Re: data locality

Sandy Ryza Fri, 18 Jul 2014 00:45:08 -0700

Hi Haopu,

Spark will ask HDFS for file block locations and try to assign tasks based
on these.

There is a snag.  Spark schedules its tasks inside of "executor" processes
that stick around for the lifetime of a Spark application.  Spark requests
executors before it runs any jobs, i.e. before it has any information about
where the input data for the jobs is located.  If the executors occupy
significantly fewer nodes than exist in the cluster, it can be difficult
for Spark to achieve data locality.  The workaround for this is an API that
allows passing in a set of preferred locations when instantiating a Spark
context.  This API is currently broken in Spark 1.0, and will likely
changed to be something a little simpler in a future release.

val locData = InputFormatInfo.computePreferredLocations
  (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new
Path(“myfile.txt”)))

val sc = new SparkContext(conf, locData)

-Sandy

On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <hw...@qilinsoft.com> wrote:

>  I have a standalone spark cluster and a HDFS cluster which share some of
> nodes.
>
>
>
> When reading HDFS file, how does spark assign tasks to nodes? Will it ask
> HDFS the location for each file block in order to get a right worker node?
>
>
>
> How about a spark cluster on Yarn?
>
>
>
> Thank you very much!
>
>
>

Re: data locality

Reply via email to