Todd is correct. The capability to recognize repaired disks and re-incorporate them is not available in the current implementation of disk fail-in-place. So the datanode service does need to be restarted, at which point it will re-join the cluster automatically, with all its working disks.
On Fri, Nov 11, 2011 at 10:37 AM, Todd Lipcon <t...@cloudera.com> wrote: > On Fri, Nov 11, 2011 at 10:15 AM, Matt Foley <mfo...@hortonworks.com> > wrote: > > Nope; hot swap :-) > > AFAIK you can't re-add the marked-dead disk to the DN, can you? > > But yea, you can hot-swap the disk, then kick the DN process, which > should take less than 10 minutes. That means the NN won't ever notice > it's down, and you won't incur any replication costs. > > -Todd > > > > > On Nov 11, 2011, at 9:59 AM, Steve Ed <sediso...@gmail.com> wrote: > > > > I understand that with 0.20.204, loss of a disk doesn’t loss the node. > But > > if we have to replace that lost disk, its again scheduling the whole node > > down, kicking replication > > > > > > > > From: Matt Foley [mailto:mfo...@hortonworks.com] > > Sent: Friday, November 11, 2011 1:58 AM > > To: hdfs-user@hadoop.apache.org > > Subject: Re: Sizing help > > > > > > > > I agree with Ted's argument that 3x replication is way better than 2x. > But > > I do have to point out that, since 0.20.204, the loss of a disk no longer > > causes the loss of a whole node (thankfully!) unless it's the system > disk. > > So in the example given, if you estimate a disk failure every 2 hours, > each > > node only has to re-replicate about 2GB of data, not 12GB. So about > 1-in-72 > > such failures risks data loss, rather than 1-in-12. Which is still > > unacceptable, so use 3x replication! :-) > > > > --Matt > > > > On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <tdunn...@maprtech.com> > wrote: > > > > 3x replication has two effects. One is reliability. This is probably > more > > important in large clusters than small. > > > > > > > > Another important effect is data locality during map-reduce. Having 3x > > replication allows mappers to have almost all invocations read from local > > disk. 2x replication compromises this. Even where you don't have local > > data, the bandwidth available to read from 3x replicated data is 1.5x the > > bandwidth available for 2x replication. > > > > > > > > To get a rough feel for how reliable you should consider a cluster, you > can > > do a pretty simple computation. If you have 12 x 2T on a single machine > and > > you lose that machine, the remaining copies of that data must be > replicated > > before another disk fails. With HDFS and block-level replication, the > > remaining copies will be spread across the entire cluster to any disk > > failure is reasonably like to cause data loss. For a 1000 node cluster > with > > 12000 disks, it is conservative to estimate a disk failure on average > every > > 2 hours. Each node will have replicate about 12GB of data which will > take > > about 500 seconds or about 9 or 10 minutes if you only use 25% of your > > network for re-replication. The probability of a disk failure during a > 10 > > minute period is 1-exp(-10/120) = 8%. This means that roughly 1 in 12 > full > > machine failures might cause data loss. You can pick whatever you like > for > > the rate at which nodes die, but I don't think that this is acceptable. > > > > > > > > My numbers for disk failures are purposely somewhat pessimistic. If you > > change the MTBF for disks to 10 years instead of 3 years, then the > > probability of data loss after a machine failure drops, but only to about > > 2.5%. > > > > > > > > Now, I would be the first to say that these numbers feel too high, but I > > also would rather not experience enough data loss events to have a > reliable > > gut feel for how often they should occur. > > > > > > > > My feeling is that 2x is fine for data you can reconstruct and which you > > don't need to read really fast, but not good enough for data whose loss > will > > get you fired. > > > > > > > > On Mon, Nov 7, 2011 at 7:34 PM, Rita <rmorgan...@gmail.com> wrote: > > > > I have been running with 2x replication on a 500tb cluster. No issues > > whatsoever. 3x is for super paranoid. > > > > > > > > On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <tdunn...@maprtech.com> > wrote: > > > > Depending on which distribution and what your data center power limits > are > > you may save a lot of money by going with machines that have 12 x 2 or 3 > tb > > drives. With suitable engineering margins and 3 x replication you can > have > > 5 tb net data per node and 20 nodes per rack. If you want to go all > cowboy > > with 2x replication and little space to spare then you can double that > > density. > > > > On Monday, November 7, 2011, Rita <rmorgan...@gmail.com> wrote: > >> For a 1PB installation you would need close to 170 servers with 12 TB > disk > >> pack installed on them (with replication factor of 2). Thats a > conservative > >> estimate > >> CPUs: 4 cores with 16gb of memory > >> > >> Namenode: 4 core with 32gb of memory should be ok. > >> > >> > >> On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <sediso...@gmail.com> wrote: > >>> > >>> I am a newbie to Hadoop and trying to understand how to Size a Hadoop > >>> cluster. > >>> > >>> > >>> > >>> What are factors I should consider deciding the number of datanodes ? > >>> > >>> Datanode configuration ? CPU, Memory > >>> > >>> Amount of memory required for namenode ? > >>> > >>> > >>> > >>> My client is looking at 1 PB of usable data and will be running > >>> analytics on TB size files using mapreduce. > >>> > >>> > >>> > >>> > >>> > >>> Thanks > >>> > >>> ….. Steve > >>> > >>> > >> > >> > >> -- > >> --- Get your facts first, then you can distort them as you please.-- > >> > > > > > > > > -- > > --- Get your facts first, then you can distort them as you please.-- > > > > > > > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >