Thank you Scott and Jonathan, this is exactly the sort of information I've been looking for. Coming from environments where data integrity is not ensured in the same distributed manner as with HDFS, concatenation or striping without parity would make me lose sleep at night.
What we were thinking for our first deployment was 10 HP DL385's each with 8 2TB SATA drives. First pair in Raid1 for the system drive, the remaining each containing a distinct partition and mount point, then specified in hdfs-site.xml in comma-delimited fashion. Seems to make more sense to use Raid at least for the system drives so the loss of 1 drive won't take down the entire node. Granted data integrity wouldn't be affected but how much time do you want to spend rebuilding an entire node due to the loss of one drive. Considered using a smaller pair for the system drives but if they're all the same then we only need to stock one type of spare drive. Another question I have is whether using 1TB drives would be advisable over 2TB for the purpose of reducing rebuild time. Or perhaps I'm still thinking of this as I would a Raid volume. If we needed to rebalance across the cluster would the time needed be more dependent on the amount of data involved and the connectivity between nodes? -John On 2/7/11 4:40 PM, "Scott Golby" <sgo...@conductor.com> wrote: > >> The big issues you will encounter is losing a disk - the DataNode >>process will crash, and if you comment out the affected drive, >> when you replace it you will have 9 disks full to N% and one empty >>disk. >> The DFS balancer cannot fix this - usually when I have data nodes down >>more than an hour, I format all drives in the box and rebalance. > >Yeah this bites us when we add a disk, love getting monitors going off >for "disk 90% full" when you've got the new disk at <10%. We've tried a >few tricks moving the reserved blocks up to force 'balance' it but it's >pretty ineffective by and large. > > >>> but if the loss of a single drive necessitated rebuilding an entire >>>node, and therefore being down in capacity during that period, >>> just doesn't seem to be the most efficient approach > >This bit about rebuilding the entire node isn't true, that's just >Jonathan's choice to wipe the node & an interesting one it is (we might >consider that for our small cluster). Lose a disk & you lose just the >capacity of that disk from the entire pool of space in the cluster. > >1 out of 3 copies of *some* of the HDFS blocks go away, not the entire >nodes blocks, usually this wouldn't be very much of a loss (typical 4 >disk boxes, x XYZ boxes = quite a few disks). The 1 missing replica will >likely be re-copied (I often say re-built, but that's RAID) before you >put the new disk in, but say somehow you were 100% full, you'd add the >new disk and the blocks which were in a 2 copies/replica state would copy >themselves a 3rd time. (the lack of inter-node disk balance is an issue >again here) > > >> We are building a new cluster aimed primarily at storage - we will be >>using SuperMicro 4U machines >> with 36 2TB SATA disks in three RAID6 volumes (for roughly 20TB usable >>per volume, 60 total), > >I really like the SuperMicro cases for big disk boxes. What are you >using to run the 36 disks all at once ? > >Scott Golby >