[
https://issues.apache.org/jira/browse/CRUNCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabriel Reid updated CRUNCH-644:
--------------------------------
Attachment: CRUNCH-644.patch
Patch which sets the preferred node at the time of HFile creation. I've tested
this patch on a multi-node cluster and verified that data locality is 100%
after a bulk load (before the patch, the same bulk load resulted in data
locality was about 30% after a bulk load).
> Set HDFS node affinity on created HFiles to improve locality
> ------------------------------------------------------------
>
> Key: CRUNCH-644
> URL: https://issues.apache.org/jira/browse/CRUNCH-644
> Project: Crunch
> Issue Type: Improvement
> Reporter: Gabriel Reid
> Attachments: CRUNCH-644.patch
>
>
> When creating HFiles via the {{HFileUtils.writeToHFilesForIncrementalLoad}}
> method, the underlying HDFS blocks of the created HFiles will end up on a
> selection of HDFS data nodes -- the selection of which nodes is left up to
> the HDFS Namenode. This means that there is a relatively small chance
> (depending on cluster size and replication factor) that the created HFiles
> will end up on the same physical machine as the region server which will make
> use of these HFiles, which limits the ability to use short-circuit reads to
> the local file system. Typically, this lack of locality is only really
> completely resolved after a major compaction.
> It's possible to set a node affinity on HDFS files at creation time, to
> provide a suggestion to the namenode about a preferred data node for blocks
> to be located on. The intention of this ticket is to make use of this
> functionality to set the node affinity during HFile creation in
> {{HFileUtils.writeToHFilesForIncrementalLoad}} so that at least one (HDFS)
> block of each created HFile will be located on the same physical machine as
> the region server which will be using the file (assuming HDFS data nodes are
> running on the same machines as HBase region servers).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)