you can view the Locality Level of each task within a stage by using the Spark Web UI under the Stages tab.
levels are as follows (in order of decreasing desirability): 1) PROCESS_LOCAL <- data was found directly in the executor JVM 2) NODE_LOCAL <- data was found on the same node as the executor JVM 3) RACK_LOCAL <- data was found in the same rack 4) ANY <- outside the rack also, the Aggregated Metrics by Executor section of the Stage detail view shows how much data is being shuffled across the network (Shuffle Read/Write). 0 is where you wanna be with that metric. -chris On Fri, Jul 25, 2014 at 4:13 AM, Tsai Li Ming <mailingl...@ltsai.com> wrote: > Hi, > > In the standalone mode, how can we check data locality is working as > expected when tasks are assigned? > > Thanks! > > > On 23 Jul, 2014, at 12:49 am, Sandy Ryza <sandy.r...@cloudera.com> wrote: > > On standalone there is still special handling for assigning tasks within > executors. There just isn't special handling for where to place executors, > because standalone generally places an executor on every node. > > > On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang <hw...@qilinsoft.com> wrote: > >> Sandy, >> >> >> >> I just tried the standalone cluster and didn't have chance to try Yarn >> yet. >> >> So if I understand correctly, there are **no** special handling of task >> assignment according to the HDFS block's location when Spark is running as >> a **standalone** cluster. >> >> Please correct me if I'm wrong. Thank you for your patience! >> >> >> ------------------------------ >> >> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] >> *Sent:* 2014年7月22日 9:47 >> >> *To:* user@spark.apache.org >> *Subject:* Re: data locality >> >> >> >> This currently only works for YARN. The standalone default is to place >> an executor on every node for every job. >> >> >> >> The total number of executors is specified by the user. >> >> >> >> -Sandy >> >> >> >> On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <hw...@qilinsoft.com> wrote: >> >> Sandy, >> >> >> >> Do you mean the “preferred location” is working for standalone cluster >> also? Because I check the code of SparkContext and see comments as below: >> >> >> >> // This is used only by YARN for now, but should be relevant to other >> cluster types (*Mesos*, >> >> // etc) too. This is typically generated from >> InputFormatInfo.computePreferredLocations. It >> >> // contains a map from *hostname* to a list of input format splits on >> the host. >> >> *private*[spark] *var* preferredNodeLocationData: Map[String, >> Set[SplitInfo]] = Map() >> >> >> >> BTW, even with the preferred hosts, how does Spark decide how many total >> executors to use for this application? >> >> >> >> Thanks again! >> >> >> ------------------------------ >> >> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] >> *Sent:* Friday, July 18, 2014 3:44 PM >> *To:* user@spark.apache.org >> *Subject:* Re: data locality >> >> >> >> Hi Haopu, >> >> >> >> Spark will ask HDFS for file block locations and try to assign tasks >> based on these. >> >> >> >> There is a snag. Spark schedules its tasks inside of "executor" >> processes that stick around for the lifetime of a Spark application. Spark >> requests executors before it runs any jobs, i.e. before it has any >> information about where the input data for the jobs is located. If the >> executors occupy significantly fewer nodes than exist in the cluster, it >> can be difficult for Spark to achieve data locality. The workaround for >> this is an API that allows passing in a set of preferred locations when >> instantiating a Spark context. This API is currently broken in Spark 1.0, >> and will likely changed to be something a little simpler in a future >> release. >> >> >> >> val locData = InputFormatInfo.computePreferredLocations >> >> (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new >> Path(“myfile.txt”))) >> >> >> >> val sc = new SparkContext(conf, locData) >> >> >> >> -Sandy >> >> >> >> >> >> On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <hw...@qilinsoft.com> wrote: >> >> I have a standalone spark cluster and a HDFS cluster which share some of >> nodes. >> >> >> >> When reading HDFS file, how does spark assign tasks to nodes? Will it ask >> HDFS the location for each file block in order to get a right worker node? >> >> >> >> How about a spark cluster on Yarn? >> >> >> >> Thank you very much! >> >> >> >> >> >> >> > > >