[ 
https://issues.apache.org/jira/browse/PIG-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4960:
------------------------------------
    Description: 
Sampling is not done right. Split is a special case as EOP is returned after 
each record is processed. We did fixes for that before (PIG-4480, etc), but 
still it is not done right.  

   In case of skewed join, skipInterval is applied for each record instead of 
all the records. So except for the first record all the other records are 
mostly skipped. Sampling is slightly better than worse if there is a FLATTEN of 
bag on the input record to Split as there are multiple records to process.  

  In case of order by, samples were being returned even as they were being 
updated with new data. So samples mostly contained records from the first few 
hundreds of rows.

  was:
Sampling is not done right. Split is a special case as EOP is returned after 
each record is processed. We did fixes for that before (PIG-4480, etc), but 
still it is not done right.  

   In case of skewed join, skipInterval is applied for each record instead of 
all the records. So except for the first record all the other records are 
mostly skipped. Sampling is slightly better if it is group by followed by 
skewed join on a different key as there is a bag of input to Split and there 
are multiple records.  

  In case of order by, samples were being returned even as they were being 
updated with new data. So samples mostly contained records from the first few 
hundreds of rows.  Sampling is slightly better in this case also if it is group 
by followed by order by on a different key.


> Split followed by order by/skewed join is skewed
> ------------------------------------------------
>
>                 Key: PIG-4960
>                 URL: https://issues.apache.org/jira/browse/PIG-4960
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0, 0.16.1
>
>
> Sampling is not done right. Split is a special case as EOP is returned after 
> each record is processed. We did fixes for that before (PIG-4480, etc), but 
> still it is not done right.  
>    In case of skewed join, skipInterval is applied for each record instead of 
> all the records. So except for the first record all the other records are 
> mostly skipped. Sampling is slightly better than worse if there is a FLATTEN 
> of bag on the input record to Split as there are multiple records to process. 
>  
>   In case of order by, samples were being returned even as they were being 
> updated with new data. So samples mostly contained records from the first few 
> hundreds of rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to