On Thu, Jun 30, 2011 at 5:16 PM, Ted Dunning <tdunn...@maprtech.com> wrote:
> You have to consider the long-term reliability as well. > > Losing an entire set of 10 or 12 disks at once makes the overall > reliability > of a large cluster very suspect. This is because it becomes entirely too > likely that two additional drives will fail before the data on the off-line > node can be replicated. For 100 nodes, that can decrease the average time > to data loss down to less than a year. This can only be mitigated in stock > hadoop by keeping the number of drives relatively low. MapR avoids this by > not failing nodes for trivial problems. > I'd advise you to look at "stock hadoop" again. This used to be true, but was fixed a long while back by HDFS-457 and several followup JIRAs. If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches. -Todd > > On Thu, Jun 30, 2011 at 4:18 PM, Aaron Eng <a...@maprtech.com> wrote: > > > >Keeping the amount of disks per node low and the amount of nodes high > > should keep the impact of dead nodes in control. > > > > It keeps the impact of dead nodes in control but I don't think thats > > long-term cost efficient. As prices of 10GbE go down, the "keep the node > > small" arguement seems less fitting. And on another note, most servers > > manufactured in the last 10 years have dual 1GbE network interfaces. If > > one > > were to go by these calcs: > > > > >150 nodes with four 2TB disks each, with HDFS 60% full, it takes around > > ~32 > > minutes to recover > > > > It seems like that assumes a single 1GbE interface, why not leverage the > > second? > > > > On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts <evert.lamme...@sara.nl > > >wrote: > > > > > > You can get 12-24 TB in a server today, which means the loss of a > > server > > > > generates a lot of traffic -which argues for 10 Gbe. > > > > > > > > But > > > > -big increase in switch cost, especially if you (CoI warning) go > with > > > > Cisco > > > > -there have been problems with things like BIOS PXE and lights out > > > > management on 10 Gbe -probably due to the NICs being things the BIOS > > > > wasn't expecting and off the mainboard. This should improve. > > > > -I don't know how well linux works with ether that fast (field > > reports > > > > useful) > > > > -the big threat is still ToR switch failure, as that will trigger a > > > > re-replication of every block in the rack. > > > > > > Keeping the amount of disks per node low and the amount of nodes high > > > should keep the impact of dead nodes in control. A ToR switch failing > is > > > different - missing 30 nodes (~120TB) at once cannot be fixed by adding > > more > > > nodes; that actually increases ToR switch failure. Although such > failure > > is > > > quite rare to begin with, I guess. The back-of-the-envelope-calculation > I > > > made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. > > (e.g., > > > when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, > > with > > > HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing > > > should take around 640 seconds. Also see the attached spreadsheet.) > This > > > doesn't take ToR switch failure in account though. On the other hand - > > 150 > > > nodes is only ~5 racks - in such a scenario you might rather want to > shut > > > the system down completely rather than letting it replicate 20% of all > > data. > > > > > > Cheers, > > > Evert > > > -- Todd Lipcon Software Engineer, Cloudera