Another observation I had was reading over local filesystem with “file://“. it 
was stated as PROCESS_LOCAL which was confusing. 

Regards,
Liming

On 13 Sep, 2014, at 3:12 am, Nicholas Chammas <nicholas.cham...@gmail.com> 
wrote:

> Andrew,
> 
> This email was pretty helpful. I feel like this stuff should be summarized in 
> the docs somewhere, or perhaps in a blog post.
> 
> Do you know if it is?
> 
> Nick
> 
> 
> On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash <and...@andrewash.com> wrote:
> The locality is how close the data is to the code that's processing it.  
> PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
> it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
> node, or in another executor on the same node, so is a little slower because 
> the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
> -- data is on a different server so needs to be sent over the network.
> 
> Spark switches to lower locality levels when there's no unprocessed data on a 
> node that has idle CPUs.  In that situation you have two options: wait until 
> the busy CPUs free up so you can start another task that uses data on that 
> server, or start a new task on a farther away server that needs to bring data 
> from that remote place.  What Spark typically does is wait a bit in the hopes 
> that a busy CPU frees up.  Once that timeout expires, it starts moving the 
> data from far away to the free CPU.
> 
> The main tunable option is how far long the scheduler waits before starting 
> to move data rather than code.  Those are the spark.locality.* settings here: 
> http://spark.apache.org/docs/latest/configuration.html
> 
> If you want to prevent this from happening entirely, you can set the values 
> to ridiculously high numbers.  The documentation also mentions that "0" has 
> special meaning, so you can try that as well.
> 
> Good luck!
> Andrew
> 
> 
> On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung <coded...@cs.stanford.edu> 
> wrote:
> I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume 
> that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
> 
> When these happen things get extremely slow.
> 
> Does this mean that the executor got terminated and restarted?
> 
> Is there a way to prevent this from happening (barring the machine actually 
> going down, I'd rather stick with the same process)?
> 
> 

Reply via email to