as Mayur indicated, it's odd that you are seeing better performance from a
less-local configuration.  however, the non-deterministic behavior that you
describe is likely caused by GC pauses in your JVM process.

take note of the *spark.locality.wait* configuration parameter described
here: http://spark.apache.org/docs/latest/configuration.html

this is the amount of time the Spark execution engine waits before
launching a new task on a less-data-local node (ie. process -> node ->
rack).  by default, this is 3 seconds.

if there is excessive GC occurring on the original process-local JVM, it is
possible that another node-local JVM process could actually load the data
from HDFS (on the same node) and complete the processing before the
original process's GC finishes.

you could bump up the *spark.locality.wait* default (not recommended) or
increase your number of nodes/partitions to increase parallelism and reduce
hotspots.

also, keep an eye on your GC characteristics.  perhaps you need to increase
your Eden size to reduce the amount of movement through the GC generations
and reduce major compactions.  (the usual GC tuning fun.)

curious if others have experienced this behavior, as well?

-chris


On Fri, May 2, 2014 at 6:07 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:

> Spark would be much faster on process_local instead of node_local.
> Node_local references data from local harddisk, process_local references
> data from in-memory thread.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Tue, Apr 22, 2014 at 4:45 PM, Joe L <selme...@yahoo.com> wrote:
>
>> I got the following performance is it normal in spark to be like this.
>> some
>> times spark switchs into node_local mode from process_local and it becomes
>> 10x faster. I am very confused.
>>
>> scala> val a = sc.textFile("/user/exobrain/batselem/LUBM1000")
>> scala> f.count()
>>
>> Long = 137805557
>> took 130.809661618 s
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/help-me-tp4598.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to