There are at least two design choices in Hadoop that have implications for your scenario. 1. All the HDFS meta data is stored in name node memory -- the memory size is one limitation on how many "small" files you can have
2. The efficiency of map/reduce paradigm dictates that each mapper/reducer job has enough work to offset the overhead of spawning the job. It relies on each task reading contiguous chuck of data (typically 64MB), your small file situation will change those efficient sequential reads to larger number of inefficient random reads. Of course, small is a relative term? Jonathan 2009/5/6 陈桂芬 <chenguifen...@163.com> > Hi: > > In my application, there are many small files. But the hadoop is designed > to deal with many large files. > > I want to know why hadoop doesn’t support small files very well and where > is the bottleneck. And what can I do to improve the Hadoop’s capability of > dealing with small files. > > Thanks. > >