Zhongwei Zhu created SPARK-45057:
------------------------------------
Summary: Deadlock caused by rdd replication level of 2
Key: SPARK-45057
URL: https://issues.apache.org/jira/browse/SPARK-45057
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.4.1
Reporter: Zhongwei Zhu
When 2 tasks try to compute same rdd with replication level of 2 and running on
only 2 executors. Deadlock will happen.
||Time||Exe 1 (Task Thread 1)||Exe 1 (Shuffle Server Thread 2)||Exe 2 (Task
Thread 3)||Exe 2 (Shuffle Server Thread 4)||
|T0|write lock of rdd| | | |
|T1| | |write lock of rdd| |
|T2|replicate -> UploadBlockSync (blocked by shuffle server thread 4)| | | |
|T3| | | |Received UploadBlock request(blocked by task thread 3)|
|T4| | |replicate -> UploadBlockSync (blocked by shuffle server thread 2)| |
|T5| |Received UploadBlock request(blocked by task thread 1)| | |
|T6|Deadlock|Deadlock|Deadlock|Deadlock|
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]