[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Attachment: openblock.png

> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and hear the opinion from 
> the community.
>  
> benchmark on job's run time (sync mode is 2x - 3x slower):
> YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
> memory, each node manager has 1GB heap size.
> shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
> 1000 reduce tasks.
> results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 
> 6 mins.
>  
> benchmark on metrics of external shuffle service:
> YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
> as sync mode, shuffling 2.5 GB data.
> results: in openblockreuqestslatencymillis_ratemean and some other metrics, 
> the nodes in sync mode are 3x - 4x higher than nodes in async mode. I 
> attached some screenshots of the metrics.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Attachment: openblock-compare.png

> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and hear the opinion from 
> the community.
>  
> benchmark on job's run time (sync mode is 2x - 3x slower):
> YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
> memory, each node manager has 1GB heap size.
> shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
> 1000 reduce tasks.
> results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 
> 6 mins.
>  
> benchmark on metrics of external shuffle service:
> YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
> as sync mode, shuffling 2.5 GB data.
> results: in openblockreuqestslatencymillis_ratemean and some other metrics, 
> the nodes in sync mode are 3x - 4x higher than nodes in async mode. I 
> attached some screenshots of the metrics.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Description: 
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
 YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
 YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

!openblock.png!

!openblock-compare.png!  

  was:
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

 


> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and

[jira] [Updated] (SPARK-35865) Remove await (syncMode) in ChunkFetchRequestHandler

2021-06-23 Thread Baohe Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-35865:

Description: 
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
 YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
 YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

  was:
SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
throting the max number of threads for sending responses of chunk fetch 
requests. But it causes severe performance degradation because the throughput 
of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
sync mode configurable and makes the async mode the default. 

SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout issue 
and we rarely see sasl timeout issues with async mode in our production 
clusters today. 

Few days ago we accidentally turned on sync mode on one cluster and we observed 
severe shuffle performance degradation. As a result, We benchmarked the 
performance comparison between async and sync mode and *we suggest removing 
sync mode in the code base* as it seems not to provide any benefits today. We 
would like to share the benchmark result and hear the opinion from the 
community.

 

benchmark on job's run time (sync mode is 2x - 3x slower):
 YARN cluster setup: 6 nodes, 18 executors, each executor has 1 core and 3 GB 
memory, each node manager has 1GB heap size.

shuffle stages: 5GB shuffle data (400M key-value records), 1000 map tasks and 
1000 reduce tasks.

results: shuffle read 5GB data, async mode takes 2-3 mins and sync mode takes 6 
mins.

 

benchmark on metrics of external shuffle service:
 YARN cluster setup: 4 nodes in total. I set 2 nodes as async mode and 2 nodes 
as sync mode, shuffling 2.5 GB data.

results: in openblockreuqestslatencymillis_ratemean and some other metrics, the 
nodes in sync mode are 3x - 4x higher than nodes in async mode. I attached some 
screenshots of the metrics.

!openblock.png!

!openblock-compare.png!  


> Remove await (syncMode) in ChunkFetchRequestHandler
> ---
>
> Key: SPARK-35865
> URL: https://issues.apache.org/jira/browse/SPARK-35865
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Baohe Zhang
>Priority: Major
> Attachments: openblock-compare.png, openblock.png
>
>
> SPARK-24355 introduces syncMode to mitigate the issue of sasl timeout by 
> throting the max number of threads for sending responses of chunk fetch 
> requests. But it causes severe performance degradation because the throughput 
> of handling chunk fetch requests is reduced. SPARK-30623 makes the async and 
> sync mode configurable and makes the async mode the default. 
> SPARK-30512 uses a dedicated boss event loop to mitigate the sasl timeout 
> issue and we rarely see sasl timeout issues with async mode in our production 
> clusters today. 
> Few days ago we accidentally turned on sync mode on one cluster and we 
> observed severe shuffle performance degradation. As a result, We benchmarked 
> the performance comparison between async and sync mode and *we suggest 
> removing sync mode in the code base* as it seems not to provide any benefits 
> today. We would like to share the benchmark result and