Re: Tuning level of Parallelism: Increase or decrease?

Jacek Laskowski Wed, 03 Aug 2016 04:32:19 -0700

Hi Jestin,

I need to expand on this in the Spark notes.


Spark can handle data locality itself but if the Spark nodes run on
separate nodes than HDFS' there's always the network between them that
makes the performance worse comparing to co-location of Spark and HDFS
nodes.

These are mere details and does not really influence the main question
about level of parallelism (yet it is related).

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Aug 3, 2016 at 1:11 AM, Jestin Ma <jestinwith.a...@gmail.com> wrote:
> Hi Jacek,
> I found this page of your book here:
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html
>
> which says:  "It is therefore important to have Spark running on Hadoop YARN
> cluster if the data comes from HDFS. In Spark on YARN Spark tries to place
> tasks alongside HDFS blocks."
>
>
> So my reasoning was that since Spark takes care of data locality when
> workers load data from HDFS, I can't see why running on YARN is more
> important.
>
> Hope this makes my question clearer.
>
>
> On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>>
>> On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma <jestinwith.a...@gmail.com>
>> wrote:
>> > Hi Nikolay, I'm looking at data locality improvements for Spark, and I
>> > have
>> > conflicting sources on using YARN for Spark.
>> >
>> > Reynold said that Spark workers automatically take care of data locality
>> > here:
>> >
>> > https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>> >
>> > However, I've read elsewhere
>> >
>> > (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
>> > that Spark on YARN increases data locality because YARN tries to place
>> > tasks
>> > next to HDFS blocks.
>> >
>> > Can anyone verify/support one side or the other?
>>
>> Hi Jestin,
>>
>> I'm the author of the latter. I can't seem to find how Reynold
>> "conflicts" with what I wrote in the notes? Could you elaborate?
>>
>> I certainly may be wrong.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Tuning level of Parallelism: Increase or decrease?

Reply via email to