[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

Ashutosh Chauhan (JIRA) Tue, 10 Nov 2009 10:09:57 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775969#action_12775969
 ]


Ashutosh Chauhan commented on PIG-1038:
---------------------------------------

Another place where Hadoop's secondary sort is useful in Pig is to sort the 
index entries for Merge Join. In indexing job of Merge Join, index entries 
sampled from map tasks are grouped in one reduce task where they are sorted 
before being written to disk. Currently, Pig does the sorting, but Hadoop's 
secondary sort can be used instead. This may not result in much performance 
gains since index is small in any case, but this may be a good test case for 
secondary key optimization. This depends on how you are discovering the pattern 
as I asked in previous question. If there is POSort immediately following 
POPackage or POJoinPackage in reducer and some other conditions are met we can 
apply Secondary key sorting optimization.

> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>
>                 Key: PIG-1038
>                 URL: https://issues.apache.org/jira/browse/PIG-1038
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>         Attachments: PIG-1038-1.patch, PIG-1038-2.patch
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop 
> secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
> query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = order A by $1;
>     generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = A.$1;
>     E = distinct D;
>     generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct 
> D" to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

Reply via email to