Re: how to improve the Hadoop's capability of dealing with small files

jason hadoop Thu, 07 May 2009 08:24:34 -0700

The way I typically address that is to write a zip file using the zip
utilities. Commonly for output.
HDFS is not optimized for low latency, but for high through put for bulk
operations.


2009/5/7 Edward Capriolo <edlinuxg...@gmail.com>

> 2009/5/7 Jeff Hammerbacher <ham...@cloudera.com>:
> > Hey,
> >
> > You can read more about why small files are difficult for HDFS at
> > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
> >
> > Regards,
> > Jeff
> >
> > 2009/5/7 Piotr Praczyk <piotr.prac...@gmail.com>
> >
> >> If You want to use many small files, they are probably having the same
> >> purpose and struc?
> >> Why not use HBase instead of a raw HDFS ? Many small files would be
> packed
> >> together and the problem would disappear.
> >>
> >> cheers
> >> Piotr
> >>
> >> 2009/5/7 Jonathan Cao <jonath...@rockyou.com>
> >>
> >> > There are at least two design choices in Hadoop that have implications
> >> for
> >> > your scenario.
> >> > 1. All the HDFS meta data is stored in name node memory -- the memory
> >> size
> >> > is one limitation on how many "small" files you can have
> >> >
> >> > 2. The efficiency of map/reduce paradigm dictates that each
> >> mapper/reducer
> >> > job has enough work to offset the overhead of spawning the job.  It
> >> relies
> >> > on each task reading contiguous chuck of data (typically 64MB), your
> >> small
> >> > file situation will change those efficient sequential reads to larger
> >> > number
> >> > of inefficient random reads.
> >> >
> >> > Of course, small is a relative term?
> >> >
> >> > Jonathan
> >> >
> >> > 2009/5/6 陈桂芬 <chenguifen...@163.com>
> >> >
> >> > > Hi:
> >> > >
> >> > > In my application, there are many small files. But the hadoop is
> >> designed
> >> > > to deal with many large files.
> >> > >
> >> > > I want to know why hadoop doesn't support small files very well and
> >> where
> >> > > is the bottleneck. And what can I do to improve the Hadoop's
> capability
> >> > of
> >> > > dealing with small files.
> >> > >
> >> > > Thanks.
> >> > >
> >> > >
> >> >
> >>
> >
> When the small file problem comes up most of the talk centers around
> the inode table being in memory. The cloudera blog points out
> something:
>
> Furthermore, HDFS is not geared up to efficiently accessing small
> files: it is primarily designed for streaming access of large files.
> Reading through small files normally causes lots of seeks and lots of
> hopping from datanode to datanode to retrieve each small file, all of
> which is an inefficient data access pattern.
>
> My application attempted to load 9000 6Kb files using a single
> threaded application and the FSOutpustStream objects to write directly
> to hadoop files. My plan was to have hadoop merge these files in the
> next step. I had to abandon this plan because this process was taking
> hours. I knew HDFS had a "small file problem" but I never realized
> that I could not do this problem the 'old fashioned way'. I merged the
> files locally and uploading a few small files gave great throughput.
> Small files is not just a permanent storage issue it is a serious
> optimization.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: how to improve the Hadoop's capability of dealing with small files

Reply via email to