This feature is called "block affinity groups" and it's been under
discussion for a while, but isn't fully implemented yet.  HDFS-2576 is
not a complete solution because it doesn't change the way the balancer
works, just the initial placement of blocks.  Once heterogeneous
storage management (HDFS-2832) is implemented, you will be able to get
a similar effect through using separate storages, at the cost of
fragmenting the backing store somewhat.

Of course, "co-locating related data blocks" is often bad, not good,
because it reduces the amount of parallelism a single job can exploit,
and can increase the chance of losing an entire dataset due to node
failures.  That's one reason why the current semi-random placement
strategy has lasted so long.  In other words, this is


On Tue, Aug 26, 2014 at 5:20 AM, Gary Malouf <> wrote:
> It appears support for this type of control over block placement is going
> out in the next version of HDFS:
> On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf <> wrote:
>> One of my colleagues has been questioning me as to why Spark/HDFS makes no
>> attempts to try to co-locate related data blocks.  He pointed to this
>> paper: from 2011 on the
>> CoHadoop research and the performance improvements it yielded for
>> Map/Reduce jobs.
>> Would leveraging these ideas for writing data from Spark make sense/be
>> worthwhile?

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to