On May 21, 2009, at 3:10 PM, Stas Oskin wrote:
Hi.
If this analysis is right, I would add it can happen even on large
clusters!
I've seen this error at our cluster when we're very full (>97%) and
very
few nodes have any empty space. This usually happens because we
have two
very large nodes (10x bigger than the rest of the cluster), and
HDFS tends
to distribute writes randomly -- meaning the smaller nodes fill up
quickly,
until the balancer can catch up.
A bit of topic, do you ran the balancer manually? Or you have some
scheduler
that does it?
crontab does it for us, once an hour. We're always importing data, so
the cluster is always out-of-balance.
If the previous balancer didn't exit, the new one will simply exit.
The real trick has been to make sure the balancer doesn't get stuck --
a Nagios plugin makes sure that the stdout has been printed to in the
last hour or so, otherwise it kills the running balancer. Stuck
balancers have been an issue in the past.
Brian