Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1486#issuecomment-54776543
  
    Hey @cmccabe - thanks for looking at this. From what I can tell the current 
approach has the drawback that if there are both cached-and-non-cached 
locations, the non-cached locations will simply be ignored. This could actually 
regress performance for workloads where having e.g. 3 machine-local replicas is 
better than having 1 cached replica.
    
    To get proper delay scheduling with this, where we triage from cached 
copies to non-cached copies - I think it would be best to just use the existing 
mechanism we have for preferring replicas that are cached on a specific node in 
Sparks in-process. This corresponds to the PROCESS_LOCAL locality level.
    
    To get this I think you can make a relatively surgical change which is that 
Hadoop RDD should return a `TaskLocation` with the `executorId` populated if 
there is an in-memory replica available on that node. Then we can change the 
documentation a bit to explain that this field is a bit overloaded at the 
moment to mean two things. You would need to add a lookup to see if we 
presently have an executor on that node and find its ID.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to