It is definitely better to combine files into larger ones, if only to make sure that you use sequential reads as much as possible.
On 2/21/08 9:48 PM, "Steve Sapovits" <[EMAIL PROTECTED]> wrote: > Amar Kamat wrote: > >> File sizes and number of files (assuming thats what you want to tweak) >> is not much of a concern for map-reduce. What ultimately matters is the >> dfs-block-size and split-size. The basic unit of replication in DFS is >> the block while the basic processing unit for map-reduce is the split. >> Other parameters doesn't matter much if you control the block size >> (dfs.block.size) and the split size (mapred.min.split.size). > > What about the write side? Someone indicated to me that HDFS wasn't > real good about storing lots of small files -- that it would be better to > somehow combine things into larger files.