[
https://issues.apache.org/jira/browse/PIG-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393203#comment-15393203
]
Rohini Palaniswamy commented on PIG-4958:
-----------------------------------------
bq. Also this might overload the RM in case there are many such tasks.
Only one task for each order by statement in the pig script. So should not
cause overload.
bq. to pass consolidated stats information from the vertex manager to all the
tasks.
bq. I think this will be better achieved by making changes in Tez to make
specific information available via stats
Looking for some approach that can be used now till feature support is added
in Tez.
bq. Tez can be configured to generate counters per output.
We don't enable that in our clusters as it adds a lot of overhead to AM.
Patch can definitely be refined to fetch per output counter if available. Was
planning to address that when I did the patch for the skewed join case.
bq. It's a private Tez API, a lot of load to the RM, and it's likely to be
extremely slow since AMs run with a single RPC thread
private Tez API is a concern. But not worried about the load. Currently Pig
client is hitting AM with dagClient.getDAGStatus() every second to check
completion status and have not seen issues with AM. (Different issue - 1 sec
is too low. Should be at least 5 secs)
> Tez autoparallelism estimation for order by is higher than mapreduce
> --------------------------------------------------------------------
>
> Key: PIG-4958
> URL: https://issues.apache.org/jira/browse/PIG-4958
> Project: Pig
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.17.0
>
> Attachments: PIG-4958-withoutsecurity.patch
>
>
> The input size is calculated from the size of the samples in memory. Size
> in memory is usually 4x or more than the serialized size. Mapreduce estimates
> the number of reducers based on serialized size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)