[ https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Ash updated SPARK-3526: ------------------------------ Summary: Docs section on data locality (was: Section on data locality) > Docs section on data locality > ----------------------------- > > Key: SPARK-3526 > URL: https://issues.apache.org/jira/browse/SPARK-3526 > Project: Spark > Issue Type: Documentation > Components: Documentation > Affects Versions: 1.0.2 > Reporter: Andrew Ash > > Several threads on the mailing list have been about data locality and how to > interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc. Let's get some more > details in the docs on this concept so we can point future questions there. > A couple people appreciated the below description of locality so it could be > a good starting point: > {quote} > The locality is how close the data is to the code that's processing it. > PROCESS_LOCAL means data is in the same JVM as the code that's running, so > it's really fast. NODE_LOCAL might mean that the data is in HDFS on the same > node, or in another executor on the same node, so is a little slower because > the data has to travel across an IPC connection. RACK_LOCAL is even slower > -- data is on a different server so needs to be sent over the network. > Spark switches to lower locality levels when there's no unprocessed data on a > node that has idle CPUs. In that situation you have two options: wait until > the busy CPUs free up so you can start another task that uses data on that > server, or start a new task on a farther away server that needs to bring data > from that remote place. What Spark typically does is wait a bit in the hopes > that a busy CPU frees up. Once that timeout expires, it starts moving the > data from far away to the free CPU. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org