optimize queries like - count distinct users for each gender
------------------------------------------------------------

                 Key: PIG-1846
                 URL: https://issues.apache.org/jira/browse/PIG-1846
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Thejas M Nair


The pig group operation does not usually have to deal with skew on the group-by 
keys if the foreach statement that works on the results of group has only 
algebraic functions on the bags. But for some queries like the following, skew 
can be a problem -

{code}
user_data = load 'file' as (user, gender, age);
user_group_gender = group user_data by gender parallel 100;
dist_users_per_gender = foreach user_group_gender 
                        { 
                             dist_user = distinct user_data.user; 
                             generate group as gender, COUNT(dist_user) as 
user_count;
                        }
{code}

Since there are only 2 distinct values of the group-by key, only 2 reducers 
will actually get used in current implementation. ie, you can't get better 
performance by adding more reducers.
Similar problem is there when the data is skewed on the group key. With current 
implementation, another problem is that pig and MR has to deal with records 
with extremely large bags that have the large number of distinct user names, 
which results in high memory utilization and having to spill the bags to disk.

The query plan should be modified to handle the skew in such cases and make use 
of more reducers.




-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to