This feature is called "block affinity groups" and it's been under
discussion for a while, but isn't fully implemented yet.  HDFS-2576 is
not a complete solution because it doesn't change the way the balancer
works, just the initial placement of blocks.  Once heterogeneous
storage management (HDFS-2832) is implemented, you will be able to get
a similar effect through using separate storages, at the cost of
fragmenting the backing store somewhat.

Of course, "co-locating related data blocks" is often bad, not good,
because it reduces the amount of parallelism a single job can exploit,
and can increase the chance of losing an entire dataset due to node
failures.  That's one reason why the current semi-random placement
strategy has lasted so long.  In other words, this is
workload-dependent.

best,
Colin

On Tue, Aug 26, 2014 at 5:20 AM, Gary Malouf <malouf.g...@gmail.com> wrote:
> It appears support for this type of control over block placement is going
> out in the next version of HDFS:
> https://issues.apache.org/jira/browse/HDFS-2576
>
>
> On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf <malouf.g...@gmail.com> wrote:
>
>> One of my colleagues has been questioning me as to why Spark/HDFS makes no
>> attempts to try to co-locate related data blocks.  He pointed to this
>> paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
>> CoHadoop research and the performance improvements it yielded for
>> Map/Reduce jobs.
>>
>> Would leveraging these ideas for writing data from Spark make sense/be
>> worthwhile?
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to