[ 
https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401687#comment-13401687
 ] 

Jie Li commented on PIG-2661:
-----------------------------

An interesting problem:

Previously for order-by, Pig will force any previous pipeline to finish and 
write to disk first, and then sample the data and sort it, so the sampler will 
see the same data that will be sorted. Now we want to merge the previous 
map-only pipeline into both the sampler and order-by. The sampler will sample 
the data before that pipeline, and pass the sample results through the pipeline 
to generate the partition file. See the query:

{code}
a = load 'data' as (x,y)
b = filter a by udf(x,y)
c = foreach b generate udf(x,y)
d = order c by x
{code}

Here a->b->c is the pipeline before order-by. Previously Pig will write c to 
the disk first, and then the sampler will get samples from c; but now we want 
to avoid writing c to the disk, so the sampler will load a to get samples and 
pass them through b and c to generate the partition file. Here b and c can be 
projection, filter and any other non-blocking operators.

One concern is, would the new way of sampling still capture the distribution of 
the data to be sorted? 

||What we want||What we have now||What we'll have||
|Distribution(a->b->c)|Distribution(Sample(a->b->c))|Distribution(Sample(a)->b->c)|

It's clear that Sample will keep the original distribution, so the three 
distributions in the table would be equivalent. 

Another concern is the performance. With the patch, the sampler will do a full 
scan of the table before the filter, which might be slower than before if the 
filter is very selective. This might be acceptable considering that the sampler 
only parse a small percent of the data. Will do some benchmark.

                
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
>                 Key: PIG-2661
>                 URL: https://issues.apache.org/jira/browse/PIG-2661
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Jie Li
>            Assignee: Jie Li
>         Attachments: PIG-2661.0.patch, PIG-2661.1.patch
>
>
> See 
> https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to