[ https://issues.apache.org/jira/browse/SPARK-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-27773: ------------------------------------ Assignee: Apache Spark > Add shuffle service metric for number of exceptions caught in > TransportChannelHandler > ------------------------------------------------------------------------------------- > > Key: SPARK-27773 > URL: https://issues.apache.org/jira/browse/SPARK-27773 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 2.4.3 > Reporter: Steven Rand > Assignee: Apache Spark > Priority: Minor > > The health of the external shuffle service is currently difficult to monitor. > At least for the YARN shuffle service, the only current indication of health > is whether or not the shuffle service threads are running in the NodeManager. > However, we've seen that clients can sometimes experience elevated failure > rates on requests to the shuffle service even when those threads are running. > It would be helpful to have some indication of how often requests to the > shuffle service are failing, as this could be monitored, alerted on, etc. > One suggestion (implemented in the PR I'll attach to this ticket) is to add a > metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of > how many times we called {{TransportChannelHandler#exceptionCaught}}. I think > that this gives us the insight into request failure rates that we're > currently missing, but obviously I'm open to alternatives as well if people > have other ideas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org