That's what I saw just yesterday on one of the data nodes with this
situation (will confirm also next time it happens):
- Tmp and current were either empty or almost empty last time I checked.
- du on the entire data directory matched exactly with reported used
space in NameNode web UI and it did report that it uses some most of the
available disk space. 
- nothing else was using disk space (actually - it's dedicated DFS
cluster).

Thank you for help!
Igor

-----Original Message-----
From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
Sent: Thursday, March 05, 2009 11:05 AM
To: core-user@hadoop.apache.org
Subject: Re: DataNode stops cleaning disk?


This is unexpected unless some other process is eating up space.

Couple of things to collect next time (along with log):

  - All the contents under datanode-directory/ (especially including 
'tmp' and 'current')
  - Does 'du' of this directory match with what is reported to NameNode 
(shown on webui) by this DataNode.
  - Is there anything else taking disk space on the machine?

Raghu.

Igor Bolotin wrote:
> Normally I dislike writing about problems without being able to
provide
> some more information, but unfortunately in this case I just can't
find
> anything.
> 
>  
> 
> Here is the situation - DFS cluster running Hadoop version 0.19.0. The
> cluster is running on multiple servers with practically identical
> hardware. Everything works perfectly well, except for one thing - from
> time to time one of the data nodes (every time it's a different node)
> starts to consume more and more disk space. The node keeps going and
if
> we don't do anything - it runs out of space completely (ignoring 20GB
> reserved space settings). Once restarted - it cleans disk rapidly and
> goes back to approximately the same utilization as the rest of data
> nodes in the cluster.
> 
>  
> 
> Scanning datanodes and namenode logs and comparing thread dumps
(stacks)
> from nodes experiencing problem and those that run normally didn't
> produce any clues. Running balancer tool didn't help at all. FSCK
shows
> that everything is healthy and number of over-replicated blocks is not
> significant.
> 
>  
> 
> To me - it just looks like at some point the data node stops cleaning
> invalidated/deleted blocks, but keeps reporting space consumed by
these
> blocks as "not used", but I'm not familiar enough with the internals
and
> just plain don't have enough free time to start digging deeper.
> 
>  
> 
> Anyone has an idea what is wrong or what else we can do to find out
> what's wrong or maybe where to start looking in the code?
> 
>  
> 
> Thanks,
> 
> Igor
> 
>  
> 
> 

Reply via email to