Thank you Scott and Jonathan, this is exactly the sort of information I've
been looking for.  Coming from environments where data integrity is not
ensured in the same distributed manner as with HDFS, concatenation or
striping without parity would make me lose sleep at night.

What we were thinking for our first deployment was 10 HP DL385's each with
8 2TB SATA drives.  First pair in Raid1 for the system drive, the
remaining each containing a distinct partition and mount point, then
specified in hdfs-site.xml in comma-delimited fashion.  Seems to make more
sense to use Raid at least for the system drives so the loss of 1 drive
won't take down the entire node.  Granted data integrity wouldn't be
affected but how much time do you want to spend rebuilding an entire node
due to the loss of one drive.  Considered using a smaller pair for the
system drives but if they're all the same then we only need to stock one
type of spare drive.

Another question I have is whether using 1TB drives would be advisable
over 2TB for the purpose of reducing rebuild time.  Or perhaps I'm still
thinking of this as I would a Raid volume.  If we needed to rebalance
across the cluster would the time needed be more dependent on the amount
of data involved and the connectivity between nodes?

-John



On 2/7/11 4:40 PM, "Scott Golby" <sgo...@conductor.com> wrote:

>
>> The big issues you will encounter is losing a disk - the DataNode
>>process will crash, and if you comment out the affected drive,
>> when you replace it you will have 9 disks full to N% and one empty
>>disk.  
>> The DFS balancer cannot fix this - usually when I have data nodes down
>>more than an hour, I format all drives in the box and rebalance.
>
>Yeah this bites us when we add a disk, love getting monitors going off
>for "disk 90% full" when you've got the new disk at <10%.  We've tried a
>few tricks moving the reserved blocks up to force 'balance' it but it's
>pretty ineffective by and large.
>
>
>>> but if the loss of a single drive necessitated rebuilding an entire
>>>node, and therefore being down in capacity during that period,
>>> just doesn't seem to be the most efficient approach
>
>This bit about rebuilding the entire node isn't true, that's just
>Jonathan's choice to wipe the node & an interesting one it is (we might
>consider that for our small cluster).  Lose a disk & you lose just the
>capacity of that disk from the entire pool of space in the cluster.
>
>1 out of 3 copies of *some* of the HDFS blocks go away, not the entire
>nodes blocks, usually this wouldn't be very much of a loss (typical 4
>disk boxes, x XYZ boxes = quite a few disks).  The 1 missing replica will
>likely be re-copied (I often say re-built, but that's RAID) before you
>put the new disk in, but say somehow you were 100% full, you'd add the
>new disk and the blocks which were in a 2 copies/replica state would copy
>themselves a 3rd time.  (the lack of inter-node disk balance is an issue
>again here)
>
>
>> We are building a new cluster aimed primarily at storage - we will be
>>using SuperMicro 4U machines
>> with 36 2TB SATA disks in three RAID6 volumes (for roughly 20TB usable
>>per volume, 60 total),
>
>I really like the SuperMicro cases for big disk boxes.  What are you
>using to run the 36 disks all at once ?
>
>Scott Golby
>

Reply via email to