[ 
https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178710#comment-15178710
 ] 

Andy Sloane commented on SPARK-13631:
-------------------------------------

Did a little more testing, added some logs, and confirmed my latest hypothesis:

{code}
16/03/03 11:26:23 INFO MapOutputTrackerMaster: Registering shuffle 18 with 1 
empty maps (registerShuffle)
16/03/03 11:26:23 INFO MapOutputTrackerMaster: getting locations for shuffle 18 
reducer 0/1 (getLocationsWithLargestOutputs)
16/03/03 11:26:23 INFO MapOutputTrackerMaster: Registering shuffle 18 with 1 
map outputs (registerMapOutputs)
16/03/03 11:26:23 INFO MapOutputTrackerMaster: getting locations for shuffle 18 
reducer 0/1 (getLocationsWithLargestOutputs)
{code}

As I suspected, the call to {{getLocationsWithLargestOutputs}} in one thread is 
getting interleaved between {{registerShuffle}} and {{registerMapOutputs}} in 
another thread. I think there are two jobs which both depend on the output of a 
shuffle stage, and each are waiting for the stage to be marked finished.

So there might be a better fix in addition to the above, which is to prevent 
the first task from attempting to schedule before the map outputs are 
registered. I noticed that DAGScheduler calls {{markStageAsFinished}} before 
calling {{registerMapOutputs}}, but switching the order doesn't seem to help 
things. Still trying to understand the sequence of events here.

> getPreferredLocations race condition in spark 1.6.0?
> ----------------------------------------------------
>
>                 Key: SPARK-13631
>                 URL: https://issues.apache.org/jira/browse/SPARK-13631
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.6.0
>            Reporter: Andy Sloane
>
> We are seeing something that looks a lot like a regression from spark 1.2. 
> When we run jobs with multiple threads, we have a crash somewhere inside 
> getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs 
> instead of DAGScheduler directly.
> I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly 
> flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our 
> threaded test case, though once in a while it passes.
> The stack trace is huge, but starts like this:
> Caused by: java.lang.NullPointerException: null
>       at 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406)
>       at 
> org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366)
>       at 
> org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>       at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>       at scala.Option.getOrElse(Option.scala:120)
>       at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
>       at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
> The full trace is available here:
> https://gist.github.com/andy256/97611f19924bbf65cf49



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to