[ https://issues.apache.org/jira/browse/TEZ-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194133#comment-15194133 ]
Siddharth Seth commented on TEZ-3164: ------------------------------------- Big +1 for doing this. An external script could be used for such diagnostics, but Tez, MR etc will likely already have a lot of this information from running jobs. > Surface error histograms from the AM > ------------------------------------ > > Key: TEZ-3164 > URL: https://issues.apache.org/jira/browse/TEZ-3164 > Project: Apache Tez > Issue Type: Improvement > Reporter: Bikas Saha > > Job tasks are constantly probing the cluster. So if there are some issues in > the cluster then jobs would be the first to notice that. If we can make these > observations surface to the user then we could quickly identify cluster > issues. > Lets say a set of bad machines got added to the cluster and tasks started > seeing shuffle errors from those machines. This can slow down or hang the > job. If the AM can surface increased errors counts from source and > destination machines then that could pin point the bad machines vs having to > arrive at those machines from first principles and log searching. -- This message was sent by Atlassian JIRA (v6.3.4#6332)