[ https://issues.apache.org/jira/browse/PIG-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047267#comment-13047267 ]
Thejas M Nair commented on PIG-1846: ------------------------------------ The optimizations proposed above is applicable for only cases where the distinct happens on only one column or a single set of columns . for example, it is applicable for- res = FOREACH gby GENERATE group.c1, group.c2, FUNC(distinct in.c3); res = FOREACH gby GENERATE group.c1, group.c2, FUNC1(distinct in.c3), FUNC2(distinct in.c3); -- distinct on same column used in two functions res = FOREACH gby GENERATE group.c1, group.c2, FUNC(distinct in.(c3,c4)); -- distinct on multiple columns res = FOREACH gby GENERATE group.c1, group.c2, FUNC1(distinct in.(c3,c4)), FUNC2(distinct in.(c3,c4)); -- distinct on same set of multiple columns, used in two functions It is not applicable for - res = FOREACH gby GENERATE group.c1, group.c2, FUNC(distinct in.c3), FUNC(distinct in.c4); -- the two udfs have distinct on two different udfs. FYI, the examples here also using unsupported syntax - res = FOREACH gby GENERATE group.c1, group.c2, FUNC(DISTINCT in.c3); should actually be - res = FOREACH gby { dist_c3 = DISTINCT in.c3; GENERATE group.c1, group.c2, FUNC(dist_c3);} > optimize queries like - count distinct users for each gender > ------------------------------------------------------------ > > Key: PIG-1846 > URL: https://issues.apache.org/jira/browse/PIG-1846 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.9.0 > Reporter: Thejas M Nair > Fix For: 0.10 > > > The pig group operation does not usually have to deal with skew on the > group-by keys if the foreach statement that works on the results of group has > only algebraic functions on the bags. But for some queries like the > following, skew can be a problem - > {code} > user_data = load 'file' as (user, gender, age); > user_group_gender = group user_data by gender parallel 100; > dist_users_per_gender = foreach user_group_gender > { > dist_user = distinct user_data.user; > generate group as gender, COUNT(dist_user) as > user_count; > } > {code} > Since there are only 2 distinct values of the group-by key, only 2 reducers > will actually get used in current implementation. ie, you can't get better > performance by adding more reducers. > Similar problem is there when the data is skewed on the group key. With > current implementation, another problem is that pig and MR has to deal with > records with extremely large bags that have the large number of distinct user > names, which results in high memory utilization and having to spill the bags > to disk. > The query plan should be modified to handle the skew in such cases and make > use of more reducers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira