[ 
https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401828#comment-13401828
 ] 

Jie Li commented on PIG-2661:
-----------------------------

Some benchmark result using 1GB TPCH data lineitem:

||query||trunk||this patch||
||load-orderby-store| 1m41s (load) + 53s (sample) + 3m11s (orderby) | 38s 
(sample) + 3m27s (orderby)|
||load-orderby-filter-store| 41s (load) + 32s (sample) + 35s (orderby) | 38s 
(sample) + 50s (orderby) |

Note the filter is very selective but we didn't see the slowdown of the sample 
job. The slight slowdown of the orderby job might result from different 
serialization. In both query, we save one entire load job.

But just another issue came into my mind: though the distribution won't change, 
the number of samples might change after the pipeline. If the pipeline 
decreases #records such as filter/limit/sample, then we'll have less samples at 
the end, but we also have a smaller order-by which doesn't need many samples. 
If the pipeline increases #records such as flatten/stream, then we may end up 
with having many samples at the end, which is likely to have poor performance. 
Therefore let's just disable the sample optimization if we find these 
"exploding" pipeline operators. (what else besides flatten/stream?)
                
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
>                 Key: PIG-2661
>                 URL: https://issues.apache.org/jira/browse/PIG-2661
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Jie Li
>            Assignee: Jie Li
>         Attachments: PIG-2661.0.patch, PIG-2661.1.patch
>
>
> See 
> https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to