Rohini Palaniswamy created PIG-4449:
---------------------------------------

             Summary: Optimize the case of Order by + Limit in nested foreach
                 Key: PIG-4449
                 URL: https://issues.apache.org/jira/browse/PIG-4449
             Project: Pig
          Issue Type: Improvement
            Reporter: Rohini Palaniswamy


This is one of the very frequently used patterns

{code}
grouped_data_set = group data_set by id;

capped_data_set = foreach grouped_data_set
{
  ordered = order joined_data_set by timestamp desc;
  capped = limit ordered $num;
 generate flatten(capped);
};
{code}

But this performs very poorly when there are millions of rows for a key in the 
groupby with lot of spills.  This can be easily optimized by pushing the limit 
into the InternalSortedBag and maintain only $num records any time and avoid 
memory pressure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to