[ https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Baohe Zhang updated SPARK-35865: -------------------------------- Attachment: openblock.png > Remove await (syncMode) in ChunkFetchRequestHandler > --------------------------------------------------- > > Key: SPARK-35865 > URL: https://issues.apache.org/jira/browse/SPARK-35865 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 2.4.8, 3.1.2 > Reporter: Baohe Zhang > Priority: Major > Attachments: openblock-compare.png, openblock.png > > > SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by > throting the max number of threads for sending responses of chunk fetch > requests. But it causes severe performance degradation because the throughput > of handling chunk fetch requests is reduced. SPARK-30623 makes the async and > sync mode configurable and makes the async mode the default. > SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout > issue and we rarely see sasl timeout issues with async mode in our production > clusters today. > Few days ago we accidentally turned on sync mode on one cluster and we > observed severe shuffle performance degradation. As a result, We benchmarked > the performance comparison between async and sync mode and *we suggest > removing sync mode in the code base* as it seems not to provide any benefits > today. We would like to share the benchmark result and hear the opinion from > the community. > > benchmark on job's run time (sync mode is 2x - 3x slower): > YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB > memory, each node manager has 1GB heap size. > shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and > 1000 reduce tasks. > results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes > 6 mins. > > benchmark on metrics of external shuffle service: > YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes > as sync mode, shuffling 2.5 GB data. > results: in openblockreuqestslatencymillis_ratemean and some other metrics, > the nodes in sync mode are 3x - 4x higher than nodes in async mode. I > attached some screenshots of the metrics. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org