[jira] [Commented] (SPARK-28917) Jobs can hang because of race of RDD.dependencies

Imran Rashid (Jira) Thu, 29 Aug 2019 09:42:37 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918764#comment-16918764
 ]


Imran Rashid commented on SPARK-28917:
--------------------------------------

[~markhamstra] [~jiangxb1987] [~tgraves] [~Ngone51] would appreciate your 
thoughts on this.  I think the bug I've described above is pretty clear.  
However, the part which I'm wondering about a bit more is whether there is more 
mutability in RDD that could cause problems.

For the case I have of this, I only know for sure that the user is calling 
{{rdd.cache()}} in another thread.  But I can't see how that would leave to the 
symptoms I describe above.  I don't know that they are doing anything in ther 
user thread which would touch {{rdd.dependencies}}, but I also don't have full 
visibility into everything they are doing, so this still seems like the best 
explanation to me.

> Jobs can hang because of race of RDD.dependencies
> -------------------------------------------------
>
>                 Key: SPARK-28917
>                 URL: https://issues.apache.org/jira/browse/SPARK-28917
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.3.3, 2.4.3
>            Reporter: Imran Rashid
>            Priority: Major
>
> {{RDD.dependencies}} stores the precomputed cache value, but it is not 
> thread-safe.  This can lead to a race where the value gets overwritten, but 
> the DAGScheduler gets stuck in an inconsistent state.  In particular, this 
> can happen when there is a race between the DAGScheduler event loop, and 
> another thread (eg. a user thread, if there is multi-threaded job submission).
> First, a job is submitted by the user, which then computes the result Stage 
> and its parents:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983
> Which eventually makes a call to {{rdd.dependencies}}:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519
> At the same time, the user could also touch {{rdd.dependencies}} in another 
> thread, which could overwrite the stored value because of the race.
> Then the DAGScheduler checks the dependencies *again* later on in the job 
> submission, via {{getMissingParentStages}}
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025
> Because it will find new dependencies, it will create entirely different 
> stages.  Now the job has some orphaned stages which will never run.
> The symptoms of this are seeing disjoint sets of stages in the "Parents of 
> final stage" and the "Missing parents" messages on job submission, as well as 
> seeing repeated messages "Registered RDD X" for the same RDD id.  eg:
> {noformat}
> [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - 
> Starting job: count at XXX.scala:462
> ...
> [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> ...
> ...
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Final stage: ResultStage 5 (count at XXX.scala:462)
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Parents of final stage: List(ShuffleMapStage 4)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Missing parents: List(ShuffleMapStage 6)
> {noformat}
> Note that there is a similar issue w/ {{rdd.partitions}}. I don't see a way 
> it could mess up the scheduler (seems its only used for 
> {{rdd.partitions.length}}).  There is also an issue that {{rdd.storageLevel}} 
> is read and cached in the scheduler, but it could be modified simultaneously 
> by the user in another thread.   Similarly, I can't see a way it could effect 
> the scheduler.
> *WORKAROUND*:
> (a) call {{rdd.dependencies}} while you know that RDD is only getting touched 
> by one thread (eg. in the thread that created it, or before you submit 
> multiple jobs touching that RDD from other threads). Then that value will get 
> cached.
> (b) don't submit jobs from multiple threads.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28917) Jobs can hang because of race of RDD.dependencies

Reply via email to