[ 
https://issues.apache.org/jira/browse/PIG-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4960:
------------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Patch committed to branch-0.16 and trunk. Thanks for the review Daniel.

> Split followed by order by/skewed join is skewed
> ------------------------------------------------
>
>                 Key: PIG-4960
>                 URL: https://issues.apache.org/jira/browse/PIG-4960
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0, 0.16.1
>
>         Attachments: PIG-4960-1.patch
>
>
> Sampling is not done right. Split is a special case as EOP is returned after 
> each record is processed. We did fixes for that before (PIG-4480, etc), but 
> still it is not done right.  
>    In case of skewed join, skipInterval is applied for each record instead of 
> all the records. So except for the first record all the other records are 
> mostly skipped. Sampling is slightly better than worse if there is a FLATTEN 
> of bag on the input record to Split as there are multiple records to process. 
>  
>   In case of order by, samples were being returned even as they were being 
> updated with new data. So samples mostly contained records from the first few 
> hundreds of rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to