[ https://issues.apache.org/jira/browse/SPARK-45057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhongwei Zhu updated SPARK-45057: --------------------------------- Description: When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Task only release lock after writing into local machine and replicate to remote executor. ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task Thread T3)||Exe 2 (Shuffle Server Thread T4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by T4)| | | | |T3| | | |Received UploadBlock request from T1 (blocked by T4)| |T4| | |replicate -> UploadBlockSync (blocked by T2)| | |T5| |Received UploadBlock request from T3 (blocked by T1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| was: When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task Thread T3)||Exe 2 (Shuffle Server Thread T4)|| |T0|write lock of rdd| | | | |T1| | |write lock of rdd| | |T2|replicate -> UploadBlockSync (blocked by T4)| | | | |T3| | | |Received UploadBlock request from T1 (blocked by T4)| |T4| | |replicate -> UploadBlockSync (blocked by T2)| | |T5| |Received UploadBlock request from T3 (blocked by T1)| | | |T6|Deadlock|Deadlock|Deadlock|Deadlock| > Deadlock caused by rdd replication level of 2 > --------------------------------------------- > > Key: SPARK-45057 > URL: https://issues.apache.org/jira/browse/SPARK-45057 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.4.1 > Reporter: Zhongwei Zhu > Priority: Major > > > When 2 tasks try to compute same rdd with replication level of 2 and running > on only 2 executors. Deadlock will happen. > Task only release lock after writing into local machine and replicate to > remote executor. > > ||Time||Exe 1 (Task Thread T1)||Exe 1 (Shuffle Server Thread T2)||Exe 2 (Task > Thread T3)||Exe 2 (Shuffle Server Thread T4)|| > |T0|write lock of rdd| | | | > |T1| | |write lock of rdd| | > |T2|replicate -> UploadBlockSync (blocked by T4)| | | | > |T3| | | |Received UploadBlock request from T1 (blocked by T4)| > |T4| | |replicate -> UploadBlockSync (blocked by T2)| | > |T5| |Received UploadBlock request from T3 (blocked by T1)| | | > |T6|Deadlock|Deadlock|Deadlock|Deadlock| -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org