The block size and file roll size values depend on a few items here:

- Rate at which the data is getting written.
- Frequency of your processing layer that is expected to run over
these files (sync() can help here though).
- The way by which you'll be processing these (MR/etc.).

Too many small files isn't only a problem for NameNode (far from it in
most cases), but is rather an issue for processing - You end up
wasting cycles on opening and closing files, instead of doing good
contiguous block reads (what HDFS directly/indirectly excels at, when
combined with processing).

On Wed, Jun 6, 2012 at 7:30 PM, Mohit Anchlia <mohitanch...@gmail.com> wrote:
> We have continuous flow of data into the sequence file. I am wondering what
> would be the ideal file size before file gets rolled over. I know too many
> small files are not good but could someone tell me what would be the ideal
> size such that it doesn't overload NameNode.



-- 
Harsh J

Reply via email to