Thank you,
Any way I can measure the startup overhead in terms of time?

On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles <patr...@cloudera.com>wrote:

> Pierre,
>
> Adding to what Brian has said (some things are not explicitly mentioned in
> the HDFS design doc)...
>
> - If you have small files that take up < 64MB you do not actually use the
> entire 64MB block on disk.
> - You *do* use up RAM on the NameNode, as each block represents meta-data
> that needs to be maintained in-memory in the NameNode.
> - Hadoop won't perform optimally with very small block sizes. Hadoop I/O is
> optimized for high sustained throughput per single file/block. There is a
> penalty for doing too many seeks to get to the beginning of each block.
> Additionally, you will have a MapReduce task per small file. Each MapReduce
> task has a non-trivial startup overhead.
> - The recommendation is to consolidate your small files into large files.
> One way to do this is via SequenceFiles... put the filename in the
> SequenceFile key field, and the file's bytes in the SequenceFile value
> field.
>
> In addition to the HDFS design docs, I recommend reading this blog post:
> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
> Happy Hadooping,
>
> - Patrick
>
> On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT <pierre...@gmail.com>
> wrote:
>
> > Okay, thank you :)
> >
> >
> > On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman <bbock...@cse.unl.edu
> > >wrote:
> >
> > >
> > > On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
> > >
> > > > Hi, thanks for this fast answer :)
> > > > If so, what do you mean by blocks? If a file has to be splitted, it
> > will
> > > be
> > > > splitted when larger than 64MB?
> > > >
> > >
> > > For every 64MB of the file, Hadoop will create a separate block.  So,
> if
> > > you have a 32KB file, there will be one block of 32KB.  If the file is
> > 65MB,
> > > then it will have one block of 64MB and another block of 1MB.
> > >
> > > Splitting files is very useful for load-balancing and distributing I/O
> > > across multiple nodes.  At 32KB / file, you don't really need to split
> > the
> > > files at all.
> > >
> > > I recommend reading the HDFS design document for background issues like
> > > this:
> > >
> > > http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
> > >
> > > Brian
> > >
> > > >
> > > >
> > > >
> > > > On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman <
> bbock...@cse.unl.edu
> > > >wrote:
> > > >
> > > >> Hey Pierre,
> > > >>
> > > >> These are not traditional filesystem blocks - if you save a file
> > smaller
> > > >> than 64MB, you don't lose 64MB of file space..
> > > >>
> > > >> Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata
> > or
> > > >> so), not 64MB.
> > > >>
> > > >> Brian
> > > >>
> > > >> On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
> > > >>
> > > >>> Hi,
> > > >>> I'm porting a legacy application to hadoop and it uses a bunch of
> > small
> > > >>> files.
> > > >>> I'm aware that having such small files ain't a good idea but I'm
> not
> > > >> doing
> > > >>> the technical decisions and the port has to be done for
> yesterday...
> > > >>> Of course such small files are a problem, loading 64MB blocks for a
> > few
> > > >>> lines of text is an evident loss.
> > > >>> What will happen if I set a smaller, or even way smaller (32kB)
> > blocks?
> > > >>>
> > > >>> Thank you.
> > > >>>
> > > >>> Pierre ANCELOT.
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > http://www.neko-consulting.com
> > > > Ego sum quis ego servo
> > > > "Je suis ce que je protège"
> > > > "I am what I protect"
> > >
> > >
> >
> >
> > --
> > http://www.neko-consulting.com
> > Ego sum quis ego servo
> > "Je suis ce que je protège"
> > "I am what I protect"
> >
>



-- 
http://www.neko-consulting.com
Ego sum quis ego servo
"Je suis ce que je protège"
"I am what I protect"

Reply via email to