Re: Sizing help

Matt Foley Tue, 15 Nov 2011 14:30:35 -0800

Todd is correct.  The capability to recognize repaired disks and
re-incorporate them is not available in the current implementation of  disk
fail-in-place.  So the datanode service does need to be restarted, at which
point it will re-join the cluster automatically, with all its working disks.


On Fri, Nov 11, 2011 at 10:37 AM, Todd Lipcon <t...@cloudera.com> wrote:

> On Fri, Nov 11, 2011 at 10:15 AM, Matt Foley <mfo...@hortonworks.com>
> wrote:
> > Nope; hot swap :-)
>
> AFAIK you can't re-add the marked-dead disk to the DN, can you?
>
> But yea, you can hot-swap the disk, then kick the DN process, which
> should take less than 10 minutes. That means the NN won't ever notice
> it's down, and you won't incur any replication costs.
>
> -Todd
>
> >
> > On Nov 11, 2011, at 9:59 AM, Steve Ed <sediso...@gmail.com> wrote:
> >
> > I understand that with 0.20.204, loss of a disk doesn’t  loss the node.
> But
> > if we have to replace that lost disk, its again scheduling the whole node
> > down, kicking replication
> >
> >
> >
> > From: Matt Foley [mailto:mfo...@hortonworks.com]
> > Sent: Friday, November 11, 2011 1:58 AM
> > To: hdfs-user@hadoop.apache.org
> > Subject: Re: Sizing help
> >
> >
> >
> > I agree with Ted's argument that 3x replication is way better than 2x.
>  But
> > I do have to point out that, since 0.20.204, the loss of a disk no longer
> > causes the loss of a whole node (thankfully!) unless it's the system
> disk.
> >  So in the example given, if you estimate a disk failure every 2 hours,
> each
> > node only has to re-replicate about 2GB of data, not 12GB.  So about
> 1-in-72
> > such failures risks data loss, rather than 1-in-12.  Which is still
> > unacceptable, so use 3x replication! :-)
> >
> > --Matt
> >
> > On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <tdunn...@maprtech.com>
> wrote:
> >
> > 3x replication has two effects.  One is reliability.  This is probably
> more
> > important in large clusters than small.
> >
> >
> >
> > Another important effect is data locality during map-reduce.  Having 3x
> > replication allows mappers to have almost all invocations read from local
> > disk.  2x replication compromises this.  Even where you don't have local
> > data, the bandwidth available to read from 3x replicated data is 1.5x the
> > bandwidth available for 2x replication.
> >
> >
> >
> > To get a rough feel for how reliable you should consider a cluster, you
> can
> > do a pretty simple computation.  If you have 12 x 2T on a single machine
> and
> > you lose that machine, the remaining copies of that data must be
> replicated
> > before another disk fails.  With HDFS and block-level replication, the
> > remaining copies will be spread across the entire cluster to any disk
> > failure is reasonably like to cause data loss.  For a 1000 node cluster
> with
> > 12000 disks, it is conservative to estimate a disk failure on average
> every
> > 2 hours.  Each node will have replicate about 12GB of data which will
> take
> > about 500 seconds or about 9 or 10 minutes if you only use 25% of your
> > network for re-replication.  The probability of a disk failure  during a
> 10
> > minute period is 1-exp(-10/120) = 8%.  This means that roughly 1 in 12
> full
> > machine failures might cause data loss.   You can pick whatever you like
> for
> > the rate at which nodes die, but I don't think that this is acceptable.
> >
> >
> >
> > My numbers for disk failures are purposely somewhat pessimistic.  If you
> > change the MTBF for disks to 10 years instead of 3 years, then the
> > probability of data loss after a machine failure drops, but only to about
> > 2.5%.
> >
> >
> >
> > Now, I would be the first to say that these numbers feel too high, but I
> > also would rather not experience enough data loss events to have a
> reliable
> > gut feel for how often they should occur.
> >
> >
> >
> > My feeling is that 2x is fine for data you can reconstruct and which you
> > don't need to read really fast, but not good enough for data whose loss
> will
> > get you fired.
> >
> >
> >
> > On Mon, Nov 7, 2011 at 7:34 PM, Rita <rmorgan...@gmail.com> wrote:
> >
> > I have been running with 2x replication on a 500tb cluster. No issues
> > whatsoever. 3x is for super paranoid.
> >
> >
> >
> > On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <tdunn...@maprtech.com>
> wrote:
> >
> > Depending on which distribution and what your data center power limits
> are
> > you may save a lot of money by going with machines that have 12 x 2 or 3
> tb
> > drives.  With suitable engineering margins and 3 x replication you can
> have
> > 5 tb net data per node and 20 nodes per rack.  If you want to go all
> cowboy
> > with 2x replication and little space to spare then you can double that
> > density.
> >
> > On Monday, November 7, 2011, Rita <rmorgan...@gmail.com> wrote:
> >> For a 1PB installation you would need close to 170 servers with 12 TB
> disk
> >> pack installed on them (with replication factor of 2). Thats a
> conservative
> >> estimate
> >> CPUs: 4 cores with 16gb of memory
> >>
> >> Namenode: 4 core with 32gb of memory should be ok.
> >>
> >>
> >> On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <sediso...@gmail.com> wrote:
> >>>
> >>> I am a newbie to Hadoop and trying to understand how to Size a Hadoop
> >>> cluster.
> >>>
> >>>
> >>>
> >>> What are factors I should consider deciding the number of datanodes ?
> >>>
> >>> Datanode configuration ?  CPU, Memory
> >>>
> >>> Amount of memory required for namenode ?
> >>>
> >>>
> >>>
> >>> My client is looking at 1 PB of  usable data and will be running
> >>> analytics on TB size files using mapreduce.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Thanks
> >>>
> >>> ….. Steve
> >>>
> >>>
> >>
> >>
> >> --
> >> --- Get your facts first, then you can distort them as you please.--
> >>
> >
> >
> >
> > --
> > --- Get your facts first, then you can distort them as you please.--
> >
> >
> >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Sizing help

Reply via email to