[jira] [Commented] (PIG-4958) Tez autoparallelism estimation for order by is higher than mapreduce

Rohini Palaniswamy (JIRA) Mon, 25 Jul 2016 21:31:43 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393203#comment-15393203
 ]


Rohini Palaniswamy commented on PIG-4958:
-----------------------------------------

bq. Also this might overload the RM in case there are many such tasks.
  Only one task for each order by statement in the pig script. So should not 
cause overload.

bq. to pass consolidated stats information from the vertex manager to all the 
tasks.
bq. I think this will be better achieved by making changes in Tez to make 
specific information available via stats
  Looking for some approach that can be used now till feature support is added 
in Tez.

bq. Tez can be configured to generate counters per output.
  We don't enable that in our clusters as it adds a lot of overhead to AM. 
Patch can definitely be refined to fetch per output counter if available. Was 
planning to address that when I did the patch for the skewed join case.

bq. It's a private Tez API, a lot of load to the RM, and it's likely to be 
extremely slow since AMs run with a single RPC thread 
  private Tez API is a concern. But not worried about the load. Currently Pig 
client is hitting AM with dagClient.getDAGStatus() every second  to check 
completion status and have not seen issues with AM.  (Different issue - 1 sec 
is too low. Should be at least 5 secs)






> Tez autoparallelism estimation for order by is higher than mapreduce
> --------------------------------------------------------------------
>
>                 Key: PIG-4958
>                 URL: https://issues.apache.org/jira/browse/PIG-4958
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.17.0
>
>         Attachments: PIG-4958-withoutsecurity.patch
>
>
>   The input size is calculated from the size of the samples in memory. Size 
> in memory is usually 4x or more than the serialized size. Mapreduce estimates 
> the number of reducers based on serialized size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4958) Tez autoparallelism estimation for order by is higher than mapreduce

Reply via email to