Hello,

My company will be building a small but quickly growing Hadoop deployment, and 
I had a question regarding best practice for configuring the storage for the 
datanodes.  Cloudera has a page where they recommend a JBOD configuration over 
RAID.  My question, though, is whether they are referring to the simplest 
definition of JBOD, that being literally just a collection of heterogeneous 
drives, each with its own distinct partition and mount point?  Or are they 
referring to a concatenated span of heterogeneous drives presented to the OS as 
a single device?

Through some digging I've discovered that data volumes may be specified in a 
comma-delimited fashion in the hdfs-site.xml file and are then accessed 
individually, but are of course all available within the pool.  To test this I 
brought a Ubuntu Server 10.04 VM online (on Xen Cloud Platform) with 3 storage 
volumes.  The first is the OS, I created a single partition the second and 
third, mounting them as /hadoop-datastore/a and /hadoop-datastore/b 
respectively, specifying them in hdfs-site.xml in comma-delimited fashion.  I 
then continued to construct a single node pseudo-distributed install, executed 
the bin/start-all.sh script, and all seems just great.  The volumes are 5GB 
each, and HDFS status page shows a configured capacity of 9.84GB, so both are 
in use, I successfully added a file using bin/hadoop dfs –put.

This lead me to think that perhaps an optimal datanode configuration would be 2 
drives in Raid1 for OS, then 2-4 additional drives for data, individually 
partitioned, mounted, and configured in hdfs-site.xml.  Mirrored system drives 
would make my node more robust but data drives would still be independent.  I 
do realize that HDFS assures data redundancy at a higher level by design, but 
if the loss of a single drive necessitated rebuilding an entire node, and 
therefore being down in capacity during that period, just doesn't seem to be 
the most efficient approach.

Would love to hear what others are doing in this regard, whether anyone is 
using concatenated disks and whether the loss of a drive requires them to 
rebuild the entire system.

John Buchanan

Reply via email to