[ 
https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403606#comment-13403606
 ] 

Jie Li commented on PIG-2661:
-----------------------------

Here are some numbers for why we want to disable merging the pipeline into 
sample if there exist flatten/stream:

Query: 
{code}
A = LOAD '$input/group' USING PigStorage('|') AS (a:int, b:{});
B = foreach A generate a, flatten(b);
ret = order B by $1; 
STORE ret INTO '$output/out';
{code}

Note there is a flatten. See attached PIG-2661.plan.txt for the query plan if 
we merge the pipeline.

Test data:
1GB data, grouped into three bags.

Result:
||merge||don't merge||
|sample(17min) + orderby(14m)| pipeline(11m) + sample(1m26s) + orderby(5m)|

We can see if we merge the pipeline to the sample job, it'll be very slow, due 
to several reasons:
1) the sample job will sample all three bags, which contain all the 1GB data;
2) the sample job requires a reduce phase to aggregate the sample information;
3) the orderby job will need to re-parse the input data.

We can imagine that if we have 10GB data, the difference will be more obvious 
as the 10GB data will go through one reducer of the sample job.
                
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
>                 Key: PIG-2661
>                 URL: https://issues.apache.org/jira/browse/PIG-2661
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Jie Li
>            Assignee: Jie Li
>         Attachments: PIG-2661.0.patch, PIG-2661.1.patch, PIG-2661.2.patch, 
> PIG-2661.plan.txt
>
>
> See 
> https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to