[ 
https://issues.apache.org/jira/browse/SPARK-25634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25634:
----------------------------------
    Affects Version/s:     (was: 2.4.0)
                       3.0.0

> New Metrics in External Shuffle Service to help identify abusing application
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-25634
>                 URL: https://issues.apache.org/jira/browse/SPARK-25634
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 3.0.0
>            Reporter: Ye Zhou
>            Priority: Minor
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service. External Shuffle Service is shared by all Spark 
> applications. SPARK-24355 enables the threads reservation to handle 
> non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to 
> avoid OOM in shuffle service which could crash NodeManager. But still some 
> application may generate a large amount of shuffle blocks which could heavily 
> decrease the performance on some shuffle servers. When this abusing behavior 
> happens, it might further decreases the overall performance for other 
> applications if they happen to use the same shuffle servers. We have been 
> seeing issues like this in our cluster, but there is no way for us to figure 
> out which application is abusing shuffle service.
> SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics 
> System. It is better if we can have the following metrics and also metrics 
> divided by applicationID:
> 1. *shuffle server on-heap memory consumption for caching shuffle indexes*
> 2. *breakdown of shuffle indexes caching memory consumption by local 
> executors*
> We can generate metrics when 
> ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will 
> trigger the Cache load. We can roughly be able to get the metrics from the 
> shuffleindexfile size when putting into the cache and moved out from the 
> cache.
> 3. *shuffle server load for shuffle block fetch requests*
> 4. *breakdown of shuffle server block fetch requests load by remote executors*
> We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a 
> new OpenBlocks message is received.
> Open discussion for more metrics that could potentially influence the overall 
> shuffle service performance. 
> We can print out those metrics which are divided by applicationIDs in log, 
> since it is hard to define fixed key and use numerical value for this kind of 
> metrics. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to