Hi Bobby, We keep 2 or so replicas here at Nebraska. We have about 800TB of raw space.
As a rule of thumb, we: 1) Increase the replication of extremely important files. We are a site for the LHC, so a large part of our data is stored on tape elsewhere (but not everything!). It's an operational pain to re-download a few tens of TB, but not the end of the world. 2) Estimate that we will lose 1 file per month due to disk corruption and loss. 3) Make sure our management understands the risks due to (2) and what would occur during a double-node failure. There are 2 other similarly sized sites that roughly follow the same rules. We haven't discovered any fatal software bugs that cause data loss since the various ones in 0.19 were ironed out. Brian On Jul 21, 2010, at 8:29 PM, Bobby Dennett wrote: > The team that manages our Hadoop clusters is currently being pressured > to reduce block replication from 3 to 2 in our production cluster. This > request is for various reasons -- particularly the reduction of used > space in the cluster and potential of reduced write operations -- but > from what I've read previously, it seems to be strongly discouraged. > > Of course I can't find it now, but I recall seeing a post that Doug > Cutting was involved with stating that having replication 3 is something > like 100 times "safer" than replication 2. If I remember correctly, > there was mention of potential NameNode bugs that could introduce > undetected corrupted/missing replicas so the idea was that if more > replicas are created, the chance of this type of bug is much less. On a > related note, it seems that the companies using a reduced replication > factor (e.g. Facebook) have also built an application layer on top of > Hadoop to perform exception handling, corruption issues, etc. > Unfortunately, we do not currently have the resources to do something > similar. > > For anyone currently using a replication of 2 in production, can you > please share your experience and any issues you may have encountered? > Also, I would appreciate any thoughts about whether a replication factor > of 2 can be considered "safe". > > Thanks in advance, > -Bobby
smime.p7s
Description: S/MIME cryptographic signature