Hi, During our research into the 'small files' issues we are having I didn't find anything to explain what I see 'after' a change.
Before: all files were stored in a structure like /source/year/month/day/ where we had dozens of files in each day's direcotory (and 500+ sources). We were using a lot more memory than we expected in the NameNode so we redesigned the directory structure. Here is the 'before' summary: *1823121 files and directories, 1754612 blocks = 3577733 total. Heap Size is 1.94 GB / 1.94 GB (100%)* ** The Heap Size relative to the # of files was higher than we expected (Using 150 byte/file rule of thumb from Cloudera) so we redesigned our approach. After: simplified into /source/year_month/ and while there are a lot of files in the directory, the memory usage dropped significantly: * * *1824616 files and directories, 1754612 blocks = 3579228 total. Heap Size is 1.18 GB / 1.74 GB (67%)* ** This was a suprise, since we hadn't done the file compaction step and already saw a huge drop in memory usage. What I don't understand is why the change in memory usage? The old structure is still there (/source/year/month/day) but with no files in the tips. The reorg process only moved the files to the new structure, a separate step is going to remove the empty directories. The 'before' was after the cluster was at idle for 4+ hours so I don't think it was GC timing issue. I'm looking to understand what happened so I can make sure our capacity calculations based on # of files and # of directories is correct. We're using: 0.20.2, r911707 Thanks, Chris