[jira] [Commented] (PIG-1846) optimize queries like - count distinct users for each gender

Thejas M Nair (JIRA) Fri, 10 Jun 2011 14:32:55 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047476#comment-13047476
 ]


Thejas M Nair commented on PIG-1846:
------------------------------------

bq. yeah I was just using short-hand with the distinct thing, and assumed you 
would know what I meant
I didn't realize the mistake when I wrote the example. But short hand is more 
readable, i have created a PIG-2117 to discuss supporting that syntax. 

bq.  Regarding two distincts – we can run the initial group-bys twice, and join?
Yes, that will work. 

If the udf FUNC is algebraic and FUNC.Initial() returns something that is 
smaller than its argument (eg, COUNT), a further optimization would be -

{code}
in = FOREACH in GENERATE *, ALGFUNC$Initial(c4) as init;
gby_dist = GROUP in BY (c1, c2, c3) PARALLEL 100;
res_dist = FOREACH gby_dist GENERATE 
  group.c1, group.c2, FUNC.Initial(c3),
  ALGFUNC$Intermed(in.init) as intermed;
gby = GROUP res_dist BY (c1, c2) PARALLEL 100;
res = FOREACH gby GENERATE
  FLATTEN(group) as (c1, c2),
  FUNC2(res_dist.c3),
  ALGFUNC2(res_dist.intermed);
{code}
Where FUNC2 is like ALGFUNC2 described earlier, having FUNC2.Initial same as 
FUNC.Intermed .


> optimize queries like - count distinct users for each gender
> ------------------------------------------------------------
>
>                 Key: PIG-1846
>                 URL: https://issues.apache.org/jira/browse/PIG-1846
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Thejas M Nair
>             Fix For: 0.10
>
>
> The pig group operation does not usually have to deal with skew on the 
> group-by keys if the foreach statement that works on the results of group has 
> only algebraic functions on the bags. But for some queries like the 
> following, skew can be a problem -
> {code}
> user_data = load 'file' as (user, gender, age);
> user_group_gender = group user_data by gender parallel 100;
> dist_users_per_gender = foreach user_group_gender 
>                         { 
>                              dist_user = distinct user_data.user; 
>                              generate group as gender, COUNT(dist_user) as 
> user_count;
>                         }
> {code}
> Since there are only 2 distinct values of the group-by key, only 2 reducers 
> will actually get used in current implementation. ie, you can't get better 
> performance by adding more reducers.
> Similar problem is there when the data is skewed on the group key. With 
> current implementation, another problem is that pig and MR has to deal with 
> records with extremely large bags that have the large number of distinct user 
> names, which results in high memory utilization and having to spill the bags 
> to disk.
> The query plan should be modified to handle the skew in such cases and make 
> use of more reducers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1846) optimize queries like - count distinct users for each gender

Reply via email to