Data Locality is part of job/task scheduling responsibility. So both links you specified originally are correct, one is for the standalone mode comes with Spark, another is for the YARN. Both have this ability.
But YARN, as a very popular scheduling component, comes with MUCH, MUCH more features than the Standalone mode. You can research more on google about it. Yong ________________________________ From: Jestin Ma <jestinwith.a...@gmail.com> Sent: Tuesday, August 2, 2016 7:11 PM To: Jacek Laskowski Cc: Nikolay Zhebet; Andrew Ehrlich; user Subject: Re: Tuning level of Parallelism: Increase or decrease? Hi Jacek, I found this page of your book here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html Data Locality ยท Mastering Apache Spark<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html> jaceklaskowski.gitbooks.io Spark relies on data locality, aka data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important ... which says: "It is therefore important to have Spark running on Hadoop YARN cluster<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> if the data comes from HDFS. In Spark on YARN<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> Spark tries to place tasks alongside HDFS blocks." So my reasoning was that since Spark takes care of data locality when workers load data from HDFS, I can't see why running on YARN is more important. Hope this makes my question clearer. On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski <ja...@japila.pl<mailto:ja...@japila.pl>> wrote: On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma <jestinwith.a...@gmail.com<mailto:jestinwith.a...@gmail.com>> wrote: > Hi Nikolay, I'm looking at data locality improvements for Spark, and I have > conflicting sources on using YARN for Spark. > > Reynold said that Spark workers automatically take care of data locality > here: > https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS > > However, I've read elsewhere > (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/) > that Spark on YARN increases data locality because YARN tries to place tasks > next to HDFS blocks. > > Can anyone verify/support one side or the other? Hi Jestin, I'm the author of the latter. I can't seem to find how Reynold "conflicts" with what I wrote in the notes? Could you elaborate? I certainly may be wrong. Pozdrawiam, Jacek Laskowski ---- https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski