Yan created HDFS-6121:
-------------------------

             Summary: Support of "mount" onto HDFS directories
                 Key: HDFS-6121
                 URL: https://issues.apache.org/jira/browse/HDFS-6121
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: datanode
            Reporter: Yan


Currently, HDFS configuration can only create HDFS on one or several existing 
local file system directories. This pretty much abstracts physical disk drives 
away from HDFS users.

While it may provide conveniences in data 
movement/manipulation/management/formatting, it could deprive users a way to 
access physical disks in a more directly controlled manner.

For instance, a multi-threaded server may wish to access its disk blocks 
sequentially per thread for fear of random I/O otherwise. If the cluster boxes 
have multiple physical disk drives, and the server load is pretty much 
I/O-bound, then it will be quite reasonable to hope for disk performance 
typical of sequential I/O. Disk read-ahead and/or buffering at various layers 
may alleviate the problem to some degree, but it couldn't totally eliminate it. 
This could hurt hard performance of workloads than need to scan data.

Map/Reduce may experience the same problem as well.

For instance, HBase region servers may wish to scan disk data for each region 
in a sequential way, again, to avoid random I/O. HBase incapability in this 
regard aside, one major obstacle is with HDFS's incapability to specify 
mappings of local directories to HDFS directories. Specifically, the 
"dfs.data.dir" configuration setting only allows for the mapping from one or 
multiple local directories to the HDFS root directory. In the case of data 
nodes of multiple disk drives mounted as multiple local file system directories 
per node, the HDFS data will be spread on all disk drives in a pretty random 
manner, potentially resulting random I/O from a multi-threaded server reading 
multiple data blocks from each thread.

A seemingly simple enhancement is an introduction of mappings from one or 
multiple local FS directories to a single HDFS directory, plus necessary sanity 
checks, replication policies, advices of best practices, ..., etc, of course. 
Note that this should be an one-to-one or many-to-one mapping from local to 
HDFS directories. The other way around, though probably feasible, won't serve 
our purpose at all. This is similar to the mounting of different disks onto 
different local FS directories, and will give the users an option to place and 
access their data in a more controlled and efficient way. 

Conceptually this option will allow for local physical data partition in a 
distributed environment for application data on HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to