[ https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcelo Masiero Vanzin resolved SPARK-28917. -------------------------------------------- Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 25951 [https://github.com/apache/spark/pull/25951] > Jobs can hang because of race of RDD.dependencies > ------------------------------------------------- > > Key: SPARK-28917 > URL: https://issues.apache.org/jira/browse/SPARK-28917 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 2.3.3, 2.4.3 > Reporter: Imran Rashid > Assignee: Imran Rashid > Priority: Major > Fix For: 2.4.5, 3.0.0 > > > {{RDD.dependencies}} stores the precomputed cache value, but it is not > thread-safe. This can lead to a race where the value gets overwritten, but > the DAGScheduler gets stuck in an inconsistent state. In particular, this > can happen when there is a race between the DAGScheduler event loop, and > another thread (eg. a user thread, if there is multi-threaded job submission). > First, a job is submitted by the user, which then computes the result Stage > and its parents: > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983 > Which eventually makes a call to {{rdd.dependencies}}: > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519 > At the same time, the user could also touch {{rdd.dependencies}} in another > thread, which could overwrite the stored value because of the race. > Then the DAGScheduler checks the dependencies *again* later on in the job > submission, via {{getMissingParentStages}} > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025 > Because it will find new dependencies, it will create entirely different > stages. Now the job has some orphaned stages which will never run. > One symptoms of this are seeing disjoint sets of stages in the "Parents of > final stage" and the "Missing parents" messages on job submission (however > this is not required). > (*EDIT*: Seeing repeated msgs "Registering RDD X" actually is just fine, it > is not a symptom of a problem at all. It just means the RDD is the *input* > to multiple shuffles.) > {noformat} > [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - > Starting job: count at XXX.scala:462 > ... > [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler > logInfo - Registering RDD 14 (repartition at XXX.scala:421) > ... > ... > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Final stage: ResultStage 5 (count at XXX.scala:462) > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Parents of final stage: List(ShuffleMapStage 4) > [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler > logInfo - Registering RDD 14 (repartition at XXX.scala:421) > [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler > logInfo - Missing parents: List(ShuffleMapStage 6) > {noformat} > Another symptom is only visible with DEBUG logs turned on for DAGScheduler -- > you will calls to {{submitStage(Stage X)}} multiple times, followed by a > different set of missing stages. eg. here, we see stage 1 first is missing > stage 0 as a dependency, and then later on its missing stage 23: > {noformat} > 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1) > 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage > 0) > ... > 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1) > 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage > 23) > {noformat} > Note that there is a similar issue w/ {{rdd.partitions}}. In particular for > some RDDs, {{partitions}} references {{dependencies}} (eg. {{CoGroupedRDD}}). > > There is also an issue that {{rdd.storageLevel}} is read and cached in the > scheduler, but it could be modified simultaneously by the user in another > thread. But, I can't see a way it could effect the scheduler. > *WORKAROUND*: > (a) call {{rdd.dependencies}} while you know that RDD is only getting touched > by one thread (eg. in the thread that created it, or before you submit > multiple jobs touching that RDD from other threads). Then that value will get > cached. > (b) don't submit jobs from multiple threads. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org