HBase would enjoy a similar functionality. In our case, we'd like all
replicas for all files in a given HDFS path to land on the same set of
machines. That way, in the event of a failover, regions can be assigned to
one of these other machines that has local access to all blocks for all
region files.

On Thu, Dec 18, 2014 at 3:36 PM, Zhe Zhang <zhe.zhang.resea...@gmail.com>
wrote:
>
> > The second aspect is that our queries are time based and this time window
> > follows a familiar pattern of old data not being queried much. Hence we
> > would like to preserve the most recent data in the HDFS cache ( impala is
> > helping us manage this aspect via their command set ) but we would like
> the
> > next recent amount of data chunks to land on an SSD that is present on
> > every datanode. The remaining set of blocks which are "very old but in
> > large quantities" would land on spinning disks. The decision to choose a
> > given volume is based on the file name as we can control the filename
> that
> > is being used to generate the file.
> >
>
> Have you tried the 'setStoragePolicy' command? It's part of the HDFS
> "Heterogeneous Storage Tiers" work and seems to address your scenario.
>
> > 1. Is there a way to control that all file blocks belonging to a
> particular
> > hdfs directory & file go to the same physical datanode ( and their
> > corresponding replicas as well ? )
>
> This seems inherently hard: the file/dir could have more data than a
> single DataNode can host. Implementation wise, if requires some sort
> of a map in BlockPlacementPolicy from inode or file path to DataNode
> address.
>
> My 2 cents..
>
> --
> Zhe Zhang
> Software Engineer, Cloudera
> https://sites.google.com/site/zhezhangresearch/
>

Reply via email to