The way I typically address that is to write a zip file using the zip utilities. Commonly for output. HDFS is not optimized for low latency, but for high through put for bulk operations.
2009/5/7 Edward Capriolo <edlinuxg...@gmail.com> > 2009/5/7 Jeff Hammerbacher <ham...@cloudera.com>: > > Hey, > > > > You can read more about why small files are difficult for HDFS at > > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. > > > > Regards, > > Jeff > > > > 2009/5/7 Piotr Praczyk <piotr.prac...@gmail.com> > > > >> If You want to use many small files, they are probably having the same > >> purpose and struc? > >> Why not use HBase instead of a raw HDFS ? Many small files would be > packed > >> together and the problem would disappear. > >> > >> cheers > >> Piotr > >> > >> 2009/5/7 Jonathan Cao <jonath...@rockyou.com> > >> > >> > There are at least two design choices in Hadoop that have implications > >> for > >> > your scenario. > >> > 1. All the HDFS meta data is stored in name node memory -- the memory > >> size > >> > is one limitation on how many "small" files you can have > >> > > >> > 2. The efficiency of map/reduce paradigm dictates that each > >> mapper/reducer > >> > job has enough work to offset the overhead of spawning the job. It > >> relies > >> > on each task reading contiguous chuck of data (typically 64MB), your > >> small > >> > file situation will change those efficient sequential reads to larger > >> > number > >> > of inefficient random reads. > >> > > >> > Of course, small is a relative term? > >> > > >> > Jonathan > >> > > >> > 2009/5/6 陈桂芬 <chenguifen...@163.com> > >> > > >> > > Hi: > >> > > > >> > > In my application, there are many small files. But the hadoop is > >> designed > >> > > to deal with many large files. > >> > > > >> > > I want to know why hadoop doesn't support small files very well and > >> where > >> > > is the bottleneck. And what can I do to improve the Hadoop's > capability > >> > of > >> > > dealing with small files. > >> > > > >> > > Thanks. > >> > > > >> > > > >> > > >> > > > When the small file problem comes up most of the talk centers around > the inode table being in memory. The cloudera blog points out > something: > > Furthermore, HDFS is not geared up to efficiently accessing small > files: it is primarily designed for streaming access of large files. > Reading through small files normally causes lots of seeks and lots of > hopping from datanode to datanode to retrieve each small file, all of > which is an inefficient data access pattern. > > My application attempted to load 9000 6Kb files using a single > threaded application and the FSOutpustStream objects to write directly > to hadoop files. My plan was to have hadoop merge these files in the > next step. I had to abandon this plan because this process was taking > hours. I knew HDFS had a "small file problem" but I never realized > that I could not do this problem the 'old fashioned way'. I merged the > files locally and uploading a few small files gave great throughput. > Small files is not just a permanent storage issue it is a serious > optimization. > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals