It is definitely better to combine files into larger ones, if only to make
sure that you use sequential reads as much as possible.


On 2/21/08 9:48 PM, "Steve Sapovits" <[EMAIL PROTECTED]> wrote:

> Amar Kamat wrote:
> 
>> File sizes and number of files (assuming thats what you want to tweak)
>> is not much of a concern for map-reduce. What ultimately matters is the
>> dfs-block-size and split-size. The basic unit of replication in DFS is
>> the block while the basic processing unit for map-reduce is the split.
>> Other parameters doesn't matter much if you control the block size
>> (dfs.block.size) and the split size (mapred.min.split.size).
> 
> What about the write side?   Someone indicated to me that HDFS wasn't
> real good about storing lots of small files -- that it would be better to
> somehow combine things into larger files.

Reply via email to