[ https://issues.apache.org/jira/browse/SPARK-21493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095466#comment-16095466 ]
Sean Owen commented on SPARK-21493: ----------------------------------- How is this different from SPARK-21334 that you opened? > Add more metrics to External Shuffle Service > -------------------------------------------- > > Key: SPARK-21493 > URL: https://issues.apache.org/jira/browse/SPARK-21493 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.2.0 > Reporter: Raajay Viswanathan > Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > The current set of metrics in the external shuffle service are fairly > limited. To debug failure of the shuffle service, it would be good to get > more information regarding the state of the shuffle service. As a first cut, > the following metrics seem important: > 1. The amount of heap memory used by the External Shuffle Service process > 2. The amount of direct buffer (off-heap) memory allocated to External > Shuffle Service. In the external shuffle service, Netty uses off-heap memory. > Monitoring its usage can help in allocating appropriate resources and can > also be used to raise alarms when the allocated memory exceeds a threshold. > 3. The queue length in Netty event loops. Chunk Fetch Requests (or) Open > Block requests can be dropped as a result of Netty queue overflows (resulting > in FetchFailure). Having hard data on queue size can help in attributing > cause of failures. > Please let me know of other metrics (from Shuffle Service perspective) that > would be good to have. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org