you can view the Locality Level of each task within a stage by using the
Spark Web UI under the Stages tab.

levels are as follows (in order of decreasing desirability):
1) PROCESS_LOCAL <- data was found directly in the executor JVM
2) NODE_LOCAL <- data was found on the same node as the executor JVM
3) RACK_LOCAL <- data was found in the same rack
4) ANY <- outside the rack

also, the Aggregated Metrics by Executor section of the Stage detail view
shows how much data is being shuffled across the network (Shuffle
Read/Write).  0 is where you wanna be with that metric.

-chris


On Fri, Jul 25, 2014 at 4:13 AM, Tsai Li Ming <mailingl...@ltsai.com> wrote:

> Hi,
>
> In the standalone mode, how can we check data locality is working as
> expected when tasks are assigned?
>
> Thanks!
>
>
> On 23 Jul, 2014, at 12:49 am, Sandy Ryza <sandy.r...@cloudera.com> wrote:
>
> On standalone there is still special handling for assigning tasks within
> executors.  There just isn't special handling for where to place executors,
> because standalone generally places an executor on every node.
>
>
> On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang <hw...@qilinsoft.com> wrote:
>
>>   Sandy,
>>
>>
>>
>> I just tried the standalone cluster and didn't have chance to try Yarn
>> yet.
>>
>> So if I understand correctly, there are **no** special handling of task
>> assignment according to the HDFS block's location when Spark is running as
>> a **standalone** cluster.
>>
>> Please correct me if I'm wrong. Thank you for your patience!
>>
>>
>>  ------------------------------
>>
>> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com]
>> *Sent:* 2014年7月22日 9:47
>>
>> *To:* user@spark.apache.org
>> *Subject:* Re: data locality
>>
>>
>>
>> This currently only works for YARN.  The standalone default is to place
>> an executor on every node for every job.
>>
>>
>>
>> The total number of executors is specified by the user.
>>
>>
>>
>> -Sandy
>>
>>
>>
>> On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <hw...@qilinsoft.com> wrote:
>>
>> Sandy,
>>
>>
>>
>> Do you mean the “preferred location” is working for standalone cluster
>> also? Because I check the code of SparkContext and see comments as below:
>>
>>
>>
>>   // This is used only by YARN for now, but should be relevant to other
>> cluster types (*Mesos*,
>>
>>   // etc) too. This is typically generated from
>> InputFormatInfo.computePreferredLocations. It
>>
>>   // contains a map from *hostname* to a list of input format splits on
>> the host.
>>
>>   *private*[spark] *var* preferredNodeLocationData: Map[String,
>> Set[SplitInfo]] = Map()
>>
>>
>>
>> BTW, even with the preferred hosts, how does Spark decide how many total
>> executors to use for this application?
>>
>>
>>
>> Thanks again!
>>
>>
>>  ------------------------------
>>
>> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com]
>> *Sent:* Friday, July 18, 2014 3:44 PM
>> *To:* user@spark.apache.org
>> *Subject:* Re: data locality
>>
>>
>>
>> Hi Haopu,
>>
>>
>>
>> Spark will ask HDFS for file block locations and try to assign tasks
>> based on these.
>>
>>
>>
>> There is a snag.  Spark schedules its tasks inside of "executor"
>> processes that stick around for the lifetime of a Spark application.  Spark
>> requests executors before it runs any jobs, i.e. before it has any
>> information about where the input data for the jobs is located.  If the
>> executors occupy significantly fewer nodes than exist in the cluster, it
>> can be difficult for Spark to achieve data locality.  The workaround for
>> this is an API that allows passing in a set of preferred locations when
>> instantiating a Spark context.  This API is currently broken in Spark 1.0,
>> and will likely changed to be something a little simpler in a future
>> release.
>>
>>
>>
>> val locData = InputFormatInfo.computePreferredLocations
>>
>>   (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new
>> Path(“myfile.txt”)))
>>
>>
>>
>> val sc = new SparkContext(conf, locData)
>>
>>
>>
>> -Sandy
>>
>>
>>
>>
>>
>> On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <hw...@qilinsoft.com> wrote:
>>
>> I have a standalone spark cluster and a HDFS cluster which share some of
>> nodes.
>>
>>
>>
>> When reading HDFS file, how does spark assign tasks to nodes? Will it ask
>> HDFS the location for each file block in order to get a right worker node?
>>
>>
>>
>> How about a spark cluster on Yarn?
>>
>>
>>
>> Thank you very much!
>>
>>
>>
>>
>>
>>
>>
>
>
>

Reply via email to