On May 21, 2009, at 3:10 PM, Stas Oskin wrote:

Hi.

If this analysis is right, I would add it can happen even on large clusters!
I've seen this error at our cluster when we're very full (>97%) and very few nodes have any empty space. This usually happens because we have two very large nodes (10x bigger than the rest of the cluster), and HDFS tends to distribute writes randomly -- meaning the smaller nodes fill up quickly,
until the balancer can catch up.



A bit of topic, do you ran the balancer manually? Or you have some scheduler
that does it?

crontab does it for us, once an hour. We're always importing data, so the cluster is always out-of-balance.

If the previous balancer didn't exit, the new one will simply exit.

The real trick has been to make sure the balancer doesn't get stuck -- a Nagios plugin makes sure that the stdout has been printed to in the last hour or so, otherwise it kills the running balancer. Stuck balancers have been an issue in the past.

Brian

Reply via email to