This is unexpected unless some other process is eating up space.

Couple of things to collect next time (along with log):

- All the contents under datanode-directory/ (especially including 'tmp' and 'current') - Does 'du' of this directory match with what is reported to NameNode (shown on webui) by this DataNode.
 - Is there anything else taking disk space on the machine?

Raghu.

Igor Bolotin wrote:
Normally I dislike writing about problems without being able to provide
some more information, but unfortunately in this case I just can't find
anything.

Here is the situation - DFS cluster running Hadoop version 0.19.0. The
cluster is running on multiple servers with practically identical
hardware. Everything works perfectly well, except for one thing - from
time to time one of the data nodes (every time it's a different node)
starts to consume more and more disk space. The node keeps going and if
we don't do anything - it runs out of space completely (ignoring 20GB
reserved space settings). Once restarted - it cleans disk rapidly and
goes back to approximately the same utilization as the rest of data
nodes in the cluster.

Scanning datanodes and namenode logs and comparing thread dumps (stacks)
from nodes experiencing problem and those that run normally didn't
produce any clues. Running balancer tool didn't help at all. FSCK shows
that everything is healthy and number of over-replicated blocks is not
significant.

To me - it just looks like at some point the data node stops cleaning
invalidated/deleted blocks, but keeps reporting space consumed by these
blocks as "not used", but I'm not familiar enough with the internals and
just plain don't have enough free time to start digging deeper.

Anyone has an idea what is wrong or what else we can do to find out
what's wrong or maybe where to start looking in the code?

Thanks,

Igor



Reply via email to