This feature is called "block affinity groups" and it's been under discussion for a while, but isn't fully implemented yet. HDFS-2576 is not a complete solution because it doesn't change the way the balancer works, just the initial placement of blocks. Once heterogeneous storage management (HDFS-2832) is implemented, you will be able to get a similar effect through using separate storages, at the cost of fragmenting the backing store somewhat.
Of course, "co-locating related data blocks" is often bad, not good, because it reduces the amount of parallelism a single job can exploit, and can increase the chance of losing an entire dataset due to node failures. That's one reason why the current semi-random placement strategy has lasted so long. In other words, this is workload-dependent. best, Colin On Tue, Aug 26, 2014 at 5:20 AM, Gary Malouf <malouf.g...@gmail.com> wrote: > It appears support for this type of control over block placement is going > out in the next version of HDFS: > https://issues.apache.org/jira/browse/HDFS-2576 > > > On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf <malouf.g...@gmail.com> wrote: > >> One of my colleagues has been questioning me as to why Spark/HDFS makes no >> attempts to try to co-locate related data blocks. He pointed to this >> paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the >> CoHadoop research and the performance improvements it yielded for >> Map/Reduce jobs. >> >> Would leveraging these ideas for writing data from Spark make sense/be >> worthwhile? >> >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org