Re: Tuning level of Parallelism: Increase or decrease?

Yong Zhang Wed, 03 Aug 2016 04:02:26 -0700

Data Locality is part of job/task scheduling responsibility. So both links you 
specified originally are correct, one is for the standalone mode comes with 
Spark, another is for the YARN. Both have this ability.

But YARN, as a very popular scheduling component, comes with MUCH, MUCH more 
features than the Standalone mode. You can research more on google about it.

Yong

________________________________
From: Jestin Ma <jestinwith.a...@gmail.com>
Sent: Tuesday, August 2, 2016 7:11 PM
To: Jacek Laskowski
Cc: Nikolay Zhebet; Andrew Ehrlich; user
Subject: Re: Tuning level of Parallelism: Increase or decrease?

Hi Jacek,
I found this page of your book here: 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html
Data Locality · Mastering Apache 
Spark<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html>
jaceklaskowski.gitbooks.io
Spark relies on data locality, aka data placement or proximity to data source, 
that makes Spark jobs sensitive to where the data is located. It is therefore 
important ...

which says:  "It is therefore important to have Spark running on Hadoop YARN 
cluster<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/>
 if the data comes from HDFS. In Spark on 
YARN<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> 
Spark tries to place tasks alongside HDFS blocks."

So my reasoning was that since Spark takes care of data locality when workers 
load data from HDFS, I can't see why running on YARN is more important.

Hope this makes my question clearer.

On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski 
<ja...@japila.pl<mailto:ja...@japila.pl>> wrote:
On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma 
<jestinwith.a...@gmail.com<mailto:jestinwith.a...@gmail.com>> wrote:
> Hi Nikolay, I'm looking at data locality improvements for Spark, and I have
> conflicting sources on using YARN for Spark.
>
> Reynold said that Spark workers automatically take care of data locality
> here:
> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>
> However, I've read elsewhere
> (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
> that Spark on YARN increases data locality because YARN tries to place tasks
> next to HDFS blocks.
>
> Can anyone verify/support one side or the other?

Hi Jestin,

I'm the author of the latter. I can't seem to find how Reynold
"conflicts" with what I wrote in the notes? Could you elaborate?

I certainly may be wrong.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

Re: Tuning level of Parallelism: Increase or decrease?

Reply via email to