Another observation I had was reading over local filesystem with “file://“. it was stated as PROCESS_LOCAL which was confusing.
Regards, Liming On 13 Sep, 2014, at 3:12 am, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Andrew, > > This email was pretty helpful. I feel like this stuff should be summarized in > the docs somewhere, or perhaps in a blog post. > > Do you know if it is? > > Nick > > > On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash <and...@andrewash.com> wrote: > The locality is how close the data is to the code that's processing it. > PROCESS_LOCAL means data is in the same JVM as the code that's running, so > it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same > node, or in another executor on the same node, so is a little slower because > the data has to travel across an IPC connection. RACK_LOCAL is even slower > -- data is on a different server so needs to be sent over the network. > > Spark switches to lower locality levels when there's no unprocessed data on a > node that has idle CPUs. In that situation you have two options: wait until > the busy CPUs free up so you can start another task that uses data on that > server, or start a new task on a farther away server that needs to bring data > from that remote place. What Spark typically does is wait a bit in the hopes > that a busy CPU frees up. Once that timeout expires, it starts moving the > data from far away to the free CPU. > > The main tunable option is how far long the scheduler waits before starting > to move data rather than code. Those are the spark.locality.* settings here: > http://spark.apache.org/docs/latest/configuration.html > > If you want to prevent this from happening entirely, you can set the values > to ridiculously high numbers. The documentation also mentions that "0" has > special meaning, so you can try that as well. > > Good luck! > Andrew > > > On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung <coded...@cs.stanford.edu> > wrote: > I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume > that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. > > When these happen things get extremely slow. > > Does this mean that the executor got terminated and restarted? > > Is there a way to prevent this from happening (barring the machine actually > going down, I'd rather stick with the same process)? > >