Zhongwei Zhu created SPARK-45057: ------------------------------------ Summary: Deadlock caused by rdd replication level of 2 Key: SPARK-45057 URL: https://issues.apache.org/jira/browse/SPARK-45057 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.1 Reporter: Zhongwei Zhu
When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. ||Time||Exe 1 (Task Thread 1)||Exe 1 (Shuffle Server Thread 2)||Exe 2 (Task Thread 3)||Exe 2 (Shuffle Server Thread 4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by shuffle server thread 4)| | | | |T3| | | |Received UploadBlock request(blocked by task thread 3)| |T4| | |replicate -> UploadBlockSync (blocked by shuffle server thread 2)| | |T5| |Received UploadBlock request(blocked by task thread 1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org