RE: data locality

Haopu Wang Fri, 18 Jul 2014 01:59:04 -0700

Sandy,


Do you mean the “preferred location” is working for standalone cluster also? 
Because I check the code of SparkContext and see comments as below:

 

  // This is used only by YARN for now, but should be relevant to other cluster 
types (Mesos,

  // etc) too. This is typically generated from 
InputFormatInfo.computePreferredLocations. It

  // contains a map from hostname to a list of input format splits on the host.

  private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
Map()

 

BTW, even with the preferred hosts, how does Spark decide how many total 
executors to use for this application?

 

Thanks again!

 

________________________________

From: Sandy Ryza [mailto:sandy.r...@cloudera.com] 
Sent: Friday, July 18, 2014 3:44 PM
To: user@spark.apache.org
Subject: Re: data locality

 

Hi Haopu,

 

Spark will ask HDFS for file block locations and try to assign tasks based on 
these.

 

There is a snag.  Spark schedules its tasks inside of "executor" processes that 
stick around for the lifetime of a Spark application.  Spark requests executors 
before it runs any jobs, i.e. before it has any information about where the 
input data for the jobs is located.  If the executors occupy significantly 
fewer nodes than exist in the cluster, it can be difficult for Spark to achieve 
data locality.  The workaround for this is an API that allows passing in a set 
of preferred locations when instantiating a Spark context.  This API is 
currently broken in Spark 1.0, and will likely changed to be something a little 
simpler in a future release.

 

val locData = InputFormatInfo.computePreferredLocations

  (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new 
Path(“myfile.txt”)))

 

val sc = new SparkContext(conf, locData)

 

-Sandy

 

 

On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <hw...@qilinsoft.com> wrote:

I have a standalone spark cluster and a HDFS cluster which share some of nodes.

 

When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS 
the location for each file block in order to get a right worker node?

 

How about a spark cluster on Yarn?

 

Thank you very much!

RE: data locality

Reply via email to