Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/3345#issuecomment-68068199
  
    Ah yes, I see now.  Thanks for coming back to this one, Josh.
    
    `DAGScheduler#getPreferredLocs` is definitely broken.  You're correct that 
the access to and potential update of the `cacheLocs` needs to be routed 
through the actor.  But because of the need to return the preferred locations, 
this will be a little different than the fire-and-forget messages that are 
currently sent to the `eventProcessActor`, and will need to be an 
[`ask`](http://doc.akka.io/docs/akka/2.3.4/scala/actors.html#Ask__Send-And-Receive-Future)
 pattern instead.
    
    Something that also concerns me in looking at the usages of 
`SparkContext#getPreferredLocs` in `CoalescedRDD` and 
`PartitionerAwareUnionRDD` is that they both have a `currPrefLocs` method with 
a comment that this is supposed to "Get the *current* preferred locations from 
the DAGScheduler".  I'm not sure just what the expectation or requirement there 
for "current" is -- "current" when the RDD is defined, when actions are run on 
it, something else?  This feels like a potential race condition to me, and I am 
wondering whether it might make sense to make this getting of current preferred 
locations as lazy as possible and resolved during the execution of a job.  
That's just speculation as to the need for or desirability of that laziness, 
but I think it deserves a look.           


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to