We run several clusters of thousands of nodes (as do many companies), our
largest one has over 10K nodes. Disks, machines, memory, and network fail
all the time. The larger the scale, the higher the odds that some machine
is bad in a given day. On the other hand, scale helps. If a single node our
of 10K fails, 9,999 others participate in re-distributing state. Even a
rack failure isn't a big deal most of the time (plus typically a rack fails
due to a TOR issue, so the data is offline, but typically not lost
permanently).

Hadoop is designed to deal with this, and by-and-large it does. Critical
components (such as Namenodes) can be configured to run in an HA pair with
automatic failover. There is quite a bit of work going on by many in the
Hadoop community to keep pushing the boundaries of scale.

A node or a rack failing in a large cluster actually has less impact than
at smaller scale. With a 5-node cluster, if 1 machine crashes you've taken
20% capacity (disk and compute) offline. 1 out of 1K barely registers.
Ditto with a 3 rack cluster. Loose a rack and 1/3rd of your capacity is
offline.

It is large-scale coordinated failure you should worry about. Think several
rows of racks coming offline due to power failure, a DC going offline due
to fire in the building etc. Those are hard to deal with in software within
a single DC. They should also be more rare, but as many companies have
experienced, large scale coordinated failures do occasionally happen.

As to your question in the other email thread, it is a well-established
pattern that scaling horizontally with commodity hardware (and letting
software such as Hadoop deal with failures) help with both scale and
reducing cost.

Cheers,

Joep


On Fri, May 27, 2016 at 11:02 AM, Arun Natva <arun.na...@gmail.com> wrote:

> Deepak,
> I have managed clusters where worker nodes crashed, disks failed..
> HDFS takes care of the data replication unless you loose too many of the
> nodes where there is not enough space to fit the replicas.
>
>
>
> Sent from my iPhone
>
> On May 27, 2016, at 11:54 AM, Deepak Goel <deic...@gmail.com> wrote:
>
>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
> We are yet to see any server go down in our cluster nodes in the
> production environment? Has anyone seen reliability problems in their
> production environment? How many times?
>
> Thanks
> Deepak
>    --
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, dee...@simtree.net
> deic...@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
>

Reply via email to