I agree with Ted's argument that 3x replication is way better than 2x. But I do have to point out that, since 0.20.204, the loss of a disk no longer causes the loss of a whole node (thankfully!) unless it's the system disk. So in the example given, if you estimate a disk failure every 2 hours, each node only has to re-replicate about 2GB of data, not 12GB. So about 1-in-72 such failures risks data loss, rather than 1-in-12. Which is still unacceptable, so use 3x replication! :-) --Matt
On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <tdunn...@maprtech.com> wrote: > 3x replication has two effects. One is reliability. This is probably > more important in large clusters than small. > > Another important effect is data locality during map-reduce. Having 3x > replication allows mappers to have almost all invocations read from local > disk. 2x replication compromises this. Even where you don't have local > data, the bandwidth available to read from 3x replicated data is 1.5x the > bandwidth available for 2x replication. > > To get a rough feel for how reliable you should consider a cluster, you > can do a pretty simple computation. If you have 12 x 2T on a single > machine and you lose that machine, the remaining copies of that data must > be replicated before another disk fails. With HDFS and block-level > replication, the remaining copies will be spread across the entire cluster > to any disk failure is reasonably like to cause data loss. For a 1000 node > cluster with 12000 disks, it is conservative to estimate a disk failure on > average every 2 hours. Each node will have replicate about 12GB of data > which will take about 500 seconds or about 9 or 10 minutes if you only use > 25% of your network for re-replication. The probability of a disk failure > during a 10 minute period is 1-exp(-10/120) = 8%. This means that roughly > 1 in 12 full machine failures might cause data loss. You can pick > whatever you like for the rate at which nodes die, but I don't think that > this is acceptable. > > My numbers for disk failures are purposely somewhat pessimistic. If you > change the MTBF for disks to 10 years instead of 3 years, then the > probability of data loss after a machine failure drops, but only to about > 2.5%. > > Now, I would be the first to say that these numbers feel too high, but I > also would rather not experience enough data loss events to have a reliable > gut feel for how often they should occur. > > My feeling is that 2x is fine for data you can reconstruct and which you > don't need to read really fast, but not good enough for data whose loss > will get you fired. > > On Mon, Nov 7, 2011 at 7:34 PM, Rita <rmorgan...@gmail.com> wrote: > >> I have been running with 2x replication on a 500tb cluster. No issues >> whatsoever. 3x is for super paranoid. >> >> >> On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <tdunn...@maprtech.com>wrote: >> >>> Depending on which distribution and what your data center power limits >>> are you may save a lot of money by going with machines that have 12 x 2 or >>> 3 tb drives. With suitable engineering margins and 3 x replication you can >>> have 5 tb net data per node and 20 nodes per rack. If you want to go all >>> cowboy with 2x replication and little space to spare then you can double >>> that density. >>> >>> On Monday, November 7, 2011, Rita <rmorgan...@gmail.com> wrote: >>> > For a 1PB installation you would need close to 170 servers with 12 TB >>> disk pack installed on them (with replication factor of 2). Thats a >>> conservative estimate >>> > CPUs: 4 cores with 16gb of memory >>> > >>> > Namenode: 4 core with 32gb of memory should be ok. >>> > >>> > >>> > On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <sediso...@gmail.com> wrote: >>> >> >>> >> I am a newbie to Hadoop and trying to understand how to Size a Hadoop >>> cluster. >>> >> >>> >> >>> >> >>> >> What are factors I should consider deciding the number of datanodes ? >>> >> >>> >> Datanode configuration ? CPU, Memory >>> >> >>> >> Amount of memory required for namenode ? >>> >> >>> >> >>> >> >>> >> My client is looking at 1 PB of usable data and will be running >>> analytics on TB size files using mapreduce. >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> Thanks >>> >> >>> >> ….. Steve >>> >> >>> >> >>> > >>> > >>> > -- >>> > --- Get your facts first, then you can distort them as you please.-- >>> > >>> >> >> >> >> -- >> --- Get your facts first, then you can distort them as you please.-- >> > >