Re: large files vs many files

jason hadoop Fri, 08 May 2009 12:43:42 -0700

Is it possible that two tasks are trying to write to the same file path?


On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy <sdo...@gmail.com> wrote:

> Hi Tom (or anyone else),
> Will SequenceFile allow me to avoid problems with concurrent writes to the
> file?  I stll continue to get the following exceptions/errors in hdfs:
>
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for
> DFSClient_-1821265528
> on client 127.0.0.1 because current leaseholder is trying to recreate file.
>
> Only happens when two processes are trying to write at the same time.  Now
> ideally I don't want to buffer the data that's coming in and i want to get
> it out and into the file asap to avoid any data loss...am i missing
> something here?  is there some sort of factory i can implement to help in
> writing a lot of simultaneous data streams?
>
> thanks in advance for any suggestions
> -sasha
>
> On Wed, May 6, 2009 at 9:40 AM, Tom White <t...@cloudera.com> wrote:
>
> > Hi Sasha,
> >
> > As you say, HDFS appends are not yet working reliably enough to be
> > suitable for production use. On the other hand, having lots of little
> > files is bad for the namenode, and inefficient for MapReduce (see
> > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so
> > it's best to avoid this too.
> >
> > I would recommend using SequenceFile as a storage container for lots
> > of small pieces of data. Each key-value pair would represent one of
> > your little files (you can have a null key, if you only need to store
> > the contents of the file). You can also enable compression (use block
> > compression), and SequenceFiles are designed to work well with
> > MapReduce.
> >
> > Cheers,
> >
> > Tom
> >
> > On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy <sasha.do...@gmail.com>
> > wrote:
> > > hi there,
> > > working through a concept at the moment and was attempting to write
> lots
> > of
> > > data to few files as opposed to writing lots of data to lots of little
> > > files.  what are the thoughts on this?
> > >
> > > When I try and implement outputStream = hdfs.append(path); there
> doesn't
> > > seem to be any locking mechanism in place ... or there is and it
> doesn't
> > > work well enough for many writes per second?
> > >
> > > i have read and seen that the property "dfs.support.append" is not
> meant
> > for
> > > production use.  still, if millions of little files are as good or
> better
> > > --- or no difference -- to a few massive files then i suppose append
> > isn't
> > > something i really need.
> > >
> > > I do see a lot of stack traces with messages like:
> > >
> > > org.apache.hadoop.ipc.RemoteException:
> > > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
> > > create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on
> > client
> > > 127.0.0.1 because current leaseholder is trying to recreate file.
> > >
> > > i hope this make sense.  still a little bit confused.
> > >
> > > thanks in advance
> > > -sd
> > >
> > > --
> > > Sasha Dolgy
> > > sasha.do...@gmail.com
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: large files vs many files

Reply via email to