Raajay Viswanathan created SPARK-21493:
------------------------------------------

             Summary: Add more metrics to External Shuffle Service
                 Key: SPARK-21493
                 URL: https://issues.apache.org/jira/browse/SPARK-21493
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.2.0
            Reporter: Raajay Viswanathan
            Priority: Minor


The current set of metrics in the external shuffle service are fairly limited. 
To debug failure of the shuffle service, it would be good to get more 
information regarding the state of the shuffle service. As a first cut, the 
following metrics seem important:

1. The amount of heap memory used by the External Shuffle Service process
2. The amount of direct buffer (off-heap) memory allocated to External Shuffle 
Service. In the external shuffle service, Netty uses off-heap memory. 
Monitoring its usage can help in allocating appropriate resources and can also 
be used to raise alarms when the allocated memory exceeds a threshold.
3. The queue length in Netty event loops. Chunk Fetch Requests (or) Open Block 
requests can be dropped as a result of Netty queue overflows (resulting in 
FetchFailure). Having hard data on queue size can help in attributing cause of 
failures.

Please let me know of other metrics (from Shuffle Service perspective) that 
would be good to have. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to