[ 
https://issues.apache.org/jira/browse/PIG-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391176#comment-15391176
 ] 

Rohini Palaniswamy commented on PIG-4958:
-----------------------------------------

Solving this by getting OUTPUT_BYTES counter of sampler vertices and using that 
to estimate the input size. If there are multiple outputs in sampler vertex, 
then the OUTPUT_BYTES counter value might be high as we are not getting counter 
per output. So doing a min of (OUTPUT_BYTES , sizeEstimatedFromSamples).

Skewed join has the same issue. But fix is slightly more complicated as we need 
to include the OUTPUT_BYTES of right input as well. Currently it is estimated 
based only on the left input size. That is working out ok as memory size 
estimation which is 4x or more of left input will always be greater  than (left 
input + right input OUTPUT_BYTES). Will create a separate jira for that.

> Tez autoparallelism estimation for order by is higher than mapreduce
> --------------------------------------------------------------------
>
>                 Key: PIG-4958
>                 URL: https://issues.apache.org/jira/browse/PIG-4958
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0
>
>
>   The input size is calculated from the size of the samples in memory. Size 
> in memory is usually 4x or more than the serialized size. Mapreduce estimates 
> the number of reducers based on serialized size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to