Hi Andrew,

I agree with Nicholas.  That was a nice, concise summary of the
meaning of the locality customization options, indicators and default
Spark behaviors.  I haven't combed through the documentation
end-to-end in a while, but I'm also not sure that information is
presently represented somewhere and it would be great to persist it
somewhere besides the mailing list.

best,
-Brad

On Fri, Sep 12, 2014 at 12:12 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> Andrew,
>
> This email was pretty helpful. I feel like this stuff should be summarized
> in the docs somewhere, or perhaps in a blog post.
>
> Do you know if it is?
>
> Nick
>
>
> On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash <and...@andrewash.com> wrote:
>>
>> The locality is how close the data is to the code that's processing it.
>> PROCESS_LOCAL means data is in the same JVM as the code that's running, so
>> it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
>> same node, or in another executor on the same node, so is a little slower
>> because the data has to travel across an IPC connection.  RACK_LOCAL is even
>> slower -- data is on a different server so needs to be sent over the
>> network.
>>
>> Spark switches to lower locality levels when there's no unprocessed data
>> on a node that has idle CPUs.  In that situation you have two options: wait
>> until the busy CPUs free up so you can start another task that uses data on
>> that server, or start a new task on a farther away server that needs to
>> bring data from that remote place.  What Spark typically does is wait a bit
>> in the hopes that a busy CPU frees up.  Once that timeout expires, it starts
>> moving the data from far away to the free CPU.
>>
>> The main tunable option is how far long the scheduler waits before
>> starting to move data rather than code.  Those are the spark.locality.*
>> settings here: http://spark.apache.org/docs/latest/configuration.html
>>
>> If you want to prevent this from happening entirely, you can set the
>> values to ridiculously high numbers.  The documentation also mentions that
>> "0" has special meaning, so you can try that as well.
>>
>> Good luck!
>> Andrew
>>
>>
>> On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
>> wrote:
>>>
>>> I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
>>> assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
>>>
>>> When these happen things get extremely slow.
>>>
>>> Does this mean that the executor got terminated and restarted?
>>>
>>> Is there a way to prevent this from happening (barring the machine
>>> actually going down, I'd rather stick with the same process)?
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to