Another factor to consider, when disk is bad you may have corrupted blocks 
which may only get detected by the periodic DataBlockScanner check.
I believe each datanode tries to finish the entire scan in 
dfs.datanode.scan.period.hours (3weeks default) period.
So with 2x replication and some undetected bad disk(s), you can have blocks 
with effective replication of 1 which would lead to missing blocks eventually.

Koji


On 11/11/11 1:57 AM, "Matt Foley" <mfo...@hortonworks.com> wrote:

I agree with Ted's argument that 3x replication is way better than 2x.  But I 
do have to point out that, since 0.20.204, the loss of a disk no longer causes 
the loss of a whole node (thankfully!) unless it's the system disk.  So in the 
example given, if you estimate a disk failure every 2 hours, each node only has 
to re-replicate about 2GB of data, not 12GB.  So about 1-in-72 such failures 
risks data loss, rather than 1-in-12.  Which is still unacceptable, so use 3x 
replication! :-)
--Matt

On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning <tdunn...@maprtech.com> wrote:
3x replication has two effects.  One is reliability.  This is probably more 
important in large clusters than small.

Another important effect is data locality during map-reduce.  Having 3x 
replication allows mappers to have almost all invocations read from local disk. 
 2x replication compromises this.  Even where you don't have local data, the 
bandwidth available to read from 3x replicated data is 1.5x the bandwidth 
available for 2x replication.

To get a rough feel for how reliable you should consider a cluster, you can do 
a pretty simple computation.  If you have 12 x 2T on a single machine and you 
lose that machine, the remaining copies of that data must be replicated before 
another disk fails.  With HDFS and block-level replication, the remaining 
copies will be spread across the entire cluster to any disk failure is 
reasonably like to cause data loss.  For a 1000 node cluster with 12000 disks, 
it is conservative to estimate a disk failure on average every 2 hours.  Each 
node will have replicate about 12GB of data which will take about 500 seconds 
or about 9 or 10 minutes if you only use 25% of your network for 
re-replication.  The probability of a disk failure  during a 10 minute period 
is 1-exp(-10/120) = 8%.  This means that roughly 1 in 12 full machine failures 
might cause data loss.   You can pick whatever you like for the rate at which 
nodes die, but I don't think that this is acceptable.

My numbers for disk failures are purposely somewhat pessimistic.  If you change 
the MTBF for disks to 10 years instead of 3 years, then the probability of data 
loss after a machine failure drops, but only to about 2.5%.

Now, I would be the first to say that these numbers feel too high, but I also 
would rather not experience enough data loss events to have a reliable gut feel 
for how often they should occur.

My feeling is that 2x is fine for data you can reconstruct and which you don't 
need to read really fast, but not good enough for data whose loss will get you 
fired.

On Mon, Nov 7, 2011 at 7:34 PM, Rita <rmorgan...@gmail.com> wrote:
I have been running with 2x replication on a 500tb cluster. No issues 
whatsoever. 3x is for super paranoid.


On Mon, Nov 7, 2011 at 5:06 PM, Ted Dunning <tdunn...@maprtech.com> wrote:
Depending on which distribution and what your data center power limits are you 
may save a lot of money by going with machines that have 12 x 2 or 3 tb drives. 
 With suitable engineering margins and 3 x replication you can have 5 tb net 
data per node and 20 nodes per rack.  If you want to go all cowboy with 2x 
replication and little space to spare then you can double that density.

On Monday, November 7, 2011, Rita <rmorgan...@gmail.com> wrote:
> For a 1PB installation you would need close to 170 servers with 12 TB disk 
> pack installed on them (with replication factor of 2). Thats a conservative 
> estimate
> CPUs: 4 cores with 16gb of memory
>
> Namenode: 4 core with 32gb of memory should be ok.
>
>
> On Fri, Oct 21, 2011 at 5:40 PM, Steve Ed <sediso...@gmail.com> wrote:
>>
>> I am a newbie to Hadoop and trying to understand how to Size a Hadoop 
>> cluster.
>>
>>
>>
>> What are factors I should consider deciding the number of datanodes ?
>>
>> Datanode configuration ?  CPU, Memory
>>
>> Amount of memory required for namenode ?
>>
>>
>>
>> My client is looking at 1 PB of  usable data and will be running analytics 
>> on TB size files using mapreduce.
>>
>>
>>
>>
>>
>> Thanks
>>
>> ….. Steve
>>
>>
>
>
> --
> --- Get your facts first, then you can distort them as you please.--
>


Reply via email to