Hi,

During our research into the 'small files' issues we are having I didn't
find anything to explain what I see 'after' a change.

Before: all files were stored in a structure like /source/year/month/day/
where we had dozens of files in each day's direcotory (and 500+ sources). We
were using a lot more memory than we expected in the NameNode so we
redesigned the directory structure. Here is the 'before' summary:


*1823121 files and directories, 1754612 blocks = 3577733 total. Heap Size is
1.94 GB / 1.94 GB (100%)*

**

The Heap Size relative to the # of files was higher than we expected (Using
150 byte/file rule of thumb from Cloudera)  so we redesigned our approach.



After: simplified into /source/year_month/ and while there are a lot of
files in the directory, the memory usage dropped significantly:

* *

*1824616 files and directories, 1754612 blocks = 3579228 total. Heap Size is
1.18 GB / 1.74 GB (67%)*

**

This was a suprise, since we hadn't done the file compaction step and
already saw a huge drop in memory usage.



What I don't understand is why the change in memory usage? The old structure
is still there (/source/year/month/day) but with no files in the tips. The
reorg process only moved the files to the new structure, a separate step is
going to remove the empty directories. The 'before' was after the cluster
was at idle for 4+ hours so I don't think it was GC timing issue.



I'm looking to understand what happened so I can make sure our capacity
calculations based on # of files and # of directories is correct. We're
using: 0.20.2, r911707



Thanks,



Chris

Reply via email to