The reliability question can be answered to first order by computing replication time for a unit of storage and then computing how often that replication time will contain additional failures sufficient to cause data loss. Such data loss events should be roughly Poisson distributed with rate equal to the rate of the original failures times the probability that any failure actually is a data loss. Second order effects appear when one replication spills into the next increasing the replication period for the second event. It is difficult to impossible to account for all of the second order effects in closed form and I have found it necessary to resort to discrete event simulation to estimate failure mode probabilities in detail. For small numbers of disks per node, one second order effect that becomes important is the node failure rate.
Grouping disks into storage groups or failing an entire node when one disk fails are ways that the storage units are larger than individual disks. Use of a volume manager or RAID-0 will increase the storage unit size. These failure modes drive some limitations on cluster size since the absolute rate of storage unit failures increases with cluster size. For a fixed number of drives in each storage unit, the limiting factor is the total number of disk drives, not the number of nodes. For older versions of Hadoop, the storage unit was all drives on the system which is quite dangerous in terms of mean time to data loss. More recently, a fix has been committed to trunk (and I think .204, Todd will correct me if I am wrong) that makes the storage unit equal to a single drive. In the previous situation, it was dangerous to have too many drives on each node in large clusters. With single disk storage units, the number of drives per machine does not matter in this computation. To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive MTBF of 1000 days, we should be seeing drive failures on average once per day. With 1G ethernet and 30MB/s/node dedicated to re-replication, it will just over 10 minutes to restore replication of a single drive and will take just over 100 minutes to restore replication of an entire machine. The probability of 2 disk failures during the 15 minutes after a failure is roughly \lambda^2 e^-\lambda / 2 where \lambda = 15 minutes / 24 hours. This is a small probability so average times between data loss should be relatively long. For the larger storage unit of 10 disks, the probability is not so small and data loss should be expected every few years or so. For a 10,000 node cluster, however, we should expect the average rate of disk failure rate of one failure every 2.5 hours. Here, the number of disks is large enough that the first order computation is much less accurate since the placement of disk blocks across the cluster will often have more non-uniformity due to small counts. This non-uniformity increases the replication recovery time. With the large storage unit model, the probability that three disk failures will stack up becomes unacceptably large. Even with the single disk storage unit, the data loss rate becomes large enough that the cluster cannot be considered archival. The real question about optimal configuration depends on how fast the cluster can move data from disk. If this rate is relatively low compared to the hardware speeds, then supporting full performance from large numbers of drives is very difficult. If you can maintain high transfer rates, however, you can substantially decrease the cost of your cluster by having fewer nodes. On Wed, Aug 10, 2011 at 7:56 AM, Evert Lammerts <evert.lamme...@sara.nl>wrote: > A short, slightly off-topic question: > > > Also note that in this configuration that one cannot take > > advantage of the "keep the machine up at all costs" features in newer > > Hadoop's, which require that root, swap, and the log area be mirrored > > to be truly effective. I'm not quite convinced that those features are > > worth it yet for anything smaller than maybe a 12 disk config. > > Dell and Cloudera promote the C2100. I'd like to see the calculations > behind that config. Am I wrong thinking that keeping your cluster up with > such dense nodes will only work if you have many (order of magnitude 100+) > of them, and interconnected with 10Gb Ethernet? If you don't then recovery > times from failing disks / rack switches are going to get crazy, right? If > you want to get bang for buck, don't the proportions "disk IO / processing > power", "node storage capacity / ethernet speed" and "total amount of nodes > / ethernet speed", indicate many small nodes with not too many disks and 1Gb > Ethernet? > > Cheers, > Evert >