> On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote: > > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data > is evenly distributed among 18 machines. >
every block in HDFS (usually 64-128-256 MB) is distributed across three machines, meaning 3 machines have it local, 15 have it remote. for data locality to work properly, you need the executors to be reading in the blocks of data local to them, and not data from other parts of the files. Spark does try to do locality, but if there's only a limited set of executors, then more of the workload is remote vs local. I don't know of an obvious way to get the metrics here of local vs remote; I don't see the HDFS client library tracking that —though it should be the place to collect stats on local/remote/domain-socket-direct IO. Does anyone know somewhere in the Spark metrics which tracks placement locality? If not, both layers could have some more metrics added. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org