On Mon, Feb 7, 2011 at 2:06 PM, Jonathan Disher <jdis...@parad.net> wrote:
> I've never seen an implementation of concat volumes that tolerate a disk > failure, just like RAID0. > > Currently I have a 48 node cluster using Dell R710's with 12 disks - two > 250GB SATA drives in RAID1 for OS, and ten 1TB SATA disks as a JBOD (mounted > on /data/0 through /data/9) and listed separately in hdfs-site.xml. It > works... mostly. The big issues you will encounter is losing a disk - the > DataNode process will crash, and if you comment out the affected drive, when > you replace it you will have 9 disks full to N% and one empty disk. The DFS > balancer cannot fix this - usually when I have data nodes down more than an > hour, I format all drives in the box and rebalance. > You can manually move blocks around between volumes in a single node (while the DN is not running). It would be great someday if the DN managed this automatically. > > We are building a new cluster aimed primarily at storage - we will be using > SuperMicro 4U machines with 36 2TB SATA disks in three RAID6 volumes (for > roughly 20TB usable per volume, 60 total), plus two SSD's for OS. At this > scale, JBOD is going to kill you (rebalancing 40-50TB, even when I bond 8 > gigE interfaces together, takes too long). > Interesting. It's not JBOD that kills you there, it's the fact that you have 72TB of data on a single box. You mitigate the failure risks by going RAID6, but you're still going to have block movement issues by going with high storage Datanodes if/when they fail for any reason. > > -j > > On Feb 7, 2011, at 12:25 PM, John Buchanan wrote: > > Hello, > > My company will be building a small but quickly growing Hadoop deployment, > and I had a question regarding best practice for configuring the storage for > the datanodes. Cloudera has a page where they recommend a JBOD > configuration over RAID. My question, though, is whether they are referring > to the simplest definition of JBOD, that being literally just a collection > of heterogeneous drives, each with its own distinct partition and mount > point? Or are they referring to a concatenated span of heterogeneous drives > presented to the OS as a single device? > > Through some digging I've discovered that data volumes may be specified in > a comma-delimited fashion in the hdfs-site.xml file and are then accessed > individually, but are of course all available within the pool. To test this > I brought a Ubuntu Server 10.04 VM online (on Xen Cloud Platform) with 3 > storage volumes. The first is the OS, I created a single partition the > second and third, mounting them as /hadoop-datastore/a and > /hadoop-datastore/b respectively, specifying them in hdfs-site.xml in > comma-delimited fashion. I then continued to construct a single node > pseudo-distributed install, executed the bin/start-all.sh script, and all > seems just great. The volumes are 5GB each, and HDFS status page shows a > configured capacity of 9.84GB, so both are in use, I successfully added a > file using bin/hadoop dfs –put. > > This lead me to think that perhaps an optimal datanode configuration would > be 2 drives in Raid1 for OS, then 2-4 additional drives for data, > individually partitioned, mounted, and configured in hdfs-site.xml. > Mirrored system drives would make my node more robust but data drives would > still be independent. I do realize that HDFS assures data redundancy at a > higher level by design, but if the loss of a single drive necessitated > rebuilding an entire node, and therefore being down in capacity during that > period, just doesn't seem to be the most efficient approach. > > Would love to hear what others are doing in this regard, whether anyone is > using concatenated disks and whether the loss of a drive requires them to > rebuild the entire system. > > John Buchanan > > >