[
https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401687#comment-13401687
]
Jie Li commented on PIG-2661:
-----------------------------
An interesting problem:
Previously for order-by, Pig will force any previous pipeline to finish and
write to disk first, and then sample the data and sort it, so the sampler will
see the same data that will be sorted. Now we want to merge the previous
map-only pipeline into both the sampler and order-by. The sampler will sample
the data before that pipeline, and pass the sample results through the pipeline
to generate the partition file. See the query:
{code}
a = load 'data' as (x,y)
b = filter a by udf(x,y)
c = foreach b generate udf(x,y)
d = order c by x
{code}
Here a->b->c is the pipeline before order-by. Previously Pig will write c to
the disk first, and then the sampler will get samples from c; but now we want
to avoid writing c to the disk, so the sampler will load a to get samples and
pass them through b and c to generate the partition file. Here b and c can be
projection, filter and any other non-blocking operators.
One concern is, would the new way of sampling still capture the distribution of
the data to be sorted?
||What we want||What we have now||What we'll have||
|Distribution(a->b->c)|Distribution(Sample(a->b->c))|Distribution(Sample(a)->b->c)|
It's clear that Sample will keep the original distribution, so the three
distributions in the table would be equivalent.
Another concern is the performance. With the patch, the sampler will do a full
scan of the table before the filter, which might be slower than before if the
filter is very selective. This might be acceptable considering that the sampler
only parse a small percent of the data. Will do some benchmark.
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
> Key: PIG-2661
> URL: https://issues.apache.org/jira/browse/PIG-2661
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Jie Li
> Assignee: Jie Li
> Attachments: PIG-2661.0.patch, PIG-2661.1.patch
>
>
> See
> https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira