Ye Zhou created SPARK-25634:
-------------------------------

             Summary: New Metrics in External Shuffle Service to help identify 
abusing application
                 Key: SPARK-25634
                 URL: https://issues.apache.org/jira/browse/SPARK-25634
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle
    Affects Versions: 2.4.0
            Reporter: Ye Zhou


We run Spark on YARN, and deploy Spark external shuffle service as part of YARN 
NM aux service. External Shuffle Service is shared by all Spark applications. 
SPARK-24355 enables the threads reservation to handle non-ChunkFetchRequest. 
SPARK-21501 limits the memory usage for Guava Cache to avoid OOM in shuffle 
service which could crash NodeManager. But still some application may generate 
a large amount of shuffle blocks which could heavily decrease the performance 
on some shuffle servers. When this abusing behavior happens, it might further 
decreases the overall performance for other applications if they happen to use 
the same shuffle servers. We have been seeing issues like this in our cluster, 
but there is no way for us to figure out which application is abusing shuffle 
service.

SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics 
System. It is better if we can have the following metrics and also metrics 
divided by applicationID:

1. *shuffle server on-heap memory consumption for caching shuffle indexes*

2. *breakdown of shuffle indexes caching memory consumption by local executors*

We can generate metrics when 
ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will trigger 
the Cache load. We can roughly be able to get the metrics from the 
shuffleindexfile size when putting into the cache and moved out from the cache.

3. *shuffle server load for shuffle block fetch requests*

4. *breakdown of shuffle server block fetch requests load by remote executors*

We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a 
new OpenBlocks message is received.

Open discussion for more metrics that could potentially influence the overall 
shuffle service performance. 

We can print out those metrics which are divided by applicationIDs in log, 
since it is hard to define fixed key and use numerical value for this kind of 
metrics. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to