[ 
https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178980#comment-15178980
 ] 

Andy Sloane commented on SPARK-13631:
-------------------------------------

Hm. I see, it's just a coincidence that a thread tends to get scheduled from 
the thread pool while the shuffle task is running, and tries to schedule a new 
job based on the locations of the (currently in-process) shuffle tasks. There's 
no blocking wait on an RDD happening that's causing this job to kick off -- 
it's just trying to schedule it.

It seems like we'd want to defer finding preferred locations on an RDD which is 
currently being computed, somehow, but the job planning seems to happen 
completely up front and there aren't any indicators in an individual RDD that 
it's presently being computed. Which means we are probably unnecessarily 
recomputing RDDs in this multithreaded scheme.



> getPreferredLocations race condition in spark 1.6.0?
> ----------------------------------------------------
>
>                 Key: SPARK-13631
>                 URL: https://issues.apache.org/jira/browse/SPARK-13631
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.6.0
>            Reporter: Andy Sloane
>
> We are seeing something that looks a lot like a regression from spark 1.2. 
> When we run jobs with multiple threads, we have a crash somewhere inside 
> getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs 
> instead of DAGScheduler directly.
> I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly 
> flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our 
> threaded test case, though once in a while it passes.
> The stack trace is huge, but starts like this:
> Caused by: java.lang.NullPointerException: null
>       at 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406)
>       at 
> org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366)
>       at 
> org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>       at scala.Option.getOrElse(Option.scala:120)
>       at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
>       at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
> The full trace is available here:
> https://gist.github.com/andy256/97611f19924bbf65cf49



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to