[ https://issues.apache.org/jira/browse/PIG-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohini Palaniswamy updated PIG-4960: ------------------------------------ Resolution: Fixed Status: Resolved (was: Patch Available) Patch committed to branch-0.16 and trunk. Thanks for the review Daniel. > Split followed by order by/skewed join is skewed > ------------------------------------------------ > > Key: PIG-4960 > URL: https://issues.apache.org/jira/browse/PIG-4960 > Project: Pig > Issue Type: Bug > Reporter: Rohini Palaniswamy > Assignee: Rohini Palaniswamy > Fix For: 0.17.0, 0.16.1 > > Attachments: PIG-4960-1.patch > > > Sampling is not done right. Split is a special case as EOP is returned after > each record is processed. We did fixes for that before (PIG-4480, etc), but > still it is not done right. > In case of skewed join, skipInterval is applied for each record instead of > all the records. So except for the first record all the other records are > mostly skipped. Sampling is slightly better than worse if there is a FLATTEN > of bag on the input record to Split as there are multiple records to process. > > In case of order by, samples were being returned even as they were being > updated with new data. So samples mostly contained records from the first few > hundreds of rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332)