The block size and file roll size values depend on a few items here: - Rate at which the data is getting written. - Frequency of your processing layer that is expected to run over these files (sync() can help here though). - The way by which you'll be processing these (MR/etc.).
Too many small files isn't only a problem for NameNode (far from it in most cases), but is rather an issue for processing - You end up wasting cycles on opening and closing files, instead of doing good contiguous block reads (what HDFS directly/indirectly excels at, when combined with processing). On Wed, Jun 6, 2012 at 7:30 PM, Mohit Anchlia <mohitanch...@gmail.com> wrote: > We have continuous flow of data into the sequence file. I am wondering what > would be the ideal file size before file gets rolled over. I know too many > small files are not good but could someone tell me what would be the ideal > size such that it doesn't overload NameNode. -- Harsh J