We have an 11 node Hadoop cluster running 20.2 that has been in production for 
15 months now.  The system is used to process log files that are ingested 
daily, and the oldest files in the HDFS are deleted to free up space as needed, 
typically when the free space is less than 10% (the delete is done using 
'hadoop fs -rmr' on the parent directory of the files to be deleted).  When the 
HDFS was originally built it had 1TB of 'Non DFS' space out of the 20TB total.  
This 1TB stayed constant for at least the first year the system has been in use.

However over the last few weeks I have seen the 'Non DFS Used' as reported by 
the NameNode dfshealth.jsp page grow to 2G and rising.  The total number of 
files/directories and blocks in use has remained fairly constant over this 
time.  I am concerned that the Non DFS Used is going to consume more and more 
of the HDFS if left unchecked.  Running fcsk gave "The filesystem under path 
'/' is HEALTHY".

Questions:

A) What exactly is hadoop reporting as 'Non DFS Used', and how is it 
calculated?  Are these files on the same partition(s) as the HDFS files, but 
are not actually part of the HDFS?

2) Any ideas on what is driving the growth in Non DFS Used space?   I looked 
for things like growing log files on the datanodes but didn't find anything.

Thanks,
Scott

Reply via email to