Re: large files vs many files

Sasha Dolgy Sat, 09 May 2009 00:16:04 -0700

yes, that is the problem.  two or hundreds...data streams in very quickly.

On Fri, May 8, 2009 at 8:42 PM, jason hadoop <jason.had...@gmail.com> wrote:


> Is it possible that two tasks are trying to write to the same file path?
>
>
> On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy <sdo...@gmail.com> wrote:
>
> > Hi Tom (or anyone else),
> > Will SequenceFile allow me to avoid problems with concurrent writes to
> the
> > file?  I stll continue to get the following exceptions/errors in hdfs:
> >
> > org.apache.hadoop.ipc.RemoteException:
> > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> > failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for
> > DFSClient_-1821265528
> > on client 127.0.0.1 because current leaseholder is trying to recreate
> file.
> >
> > Only happens when two processes are trying to write at the same time.
>  Now
> > ideally I don't want to buffer the data that's coming in and i want to
> get
> > it out and into the file asap to avoid any data loss...am i missing
> > something here?  is there some sort of factory i can implement to help in
> > writing a lot of simultaneous data streams?
> >
> > thanks in advance for any suggestions
> > -sasha
> >
> > On Wed, May 6, 2009 at 9:40 AM, Tom White <t...@cloudera.com> wrote:
> >
> > > Hi Sasha,
> > >
> > > As you say, HDFS appends are not yet working reliably enough to be
> > > suitable for production use. On the other hand, having lots of little
> > > files is bad for the namenode, and inefficient for MapReduce (see
> > > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so
> > > it's best to avoid this too.
> > >
> > > I would recommend using SequenceFile as a storage container for lots
> > > of small pieces of data. Each key-value pair would represent one of
> > > your little files (you can have a null key, if you only need to store
> > > the contents of the file). You can also enable compression (use block
> > > compression), and SequenceFiles are designed to work well with
> > > MapReduce.
> > >
> > > Cheers,
> > >
> > > Tom
> > >
> > > On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy <sasha.do...@gmail.com>
> > > wrote:
> > > > hi there,
> > > > working through a concept at the moment and was attempting to write
> > lots
> > > of
> > > > data to few files as opposed to writing lots of data to lots of
> little
> > > > files.  what are the thoughts on this?
> > > >
> > > > When I try and implement outputStream = hdfs.append(path); there
> > doesn't
> > > > seem to be any locking mechanism in place ... or there is and it
> > doesn't
> > > > work well enough for many writes per second?
> > > >
> > > > i have read and seen that the property "dfs.support.append" is not
> > meant
> > > for
> > > > production use.  still, if millions of little files are as good or
> > better
> > > > --- or no difference -- to a few massive files then i suppose append
> > > isn't
> > > > something i really need.
> > > >
> > > > I do see a lot of stack traces with messages like:
> > > >
> > > > org.apache.hadoop.ipc.RemoteException:
> > > > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed
> to
> > > > create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on
> > > client
> > > > 127.0.0.1 because current leaseholder is trying to recreate file.
> > > >
> > > > i hope this make sense.  still a little bit confused.
> > > >
> > > > thanks in advance
> > > > -sd
> > > >
> > > > --
> > > > Sasha Dolgy
> > > > sasha.do...@gmail.com
> > >
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>



-- 
Sasha Dolgy
sasha.do...@gmail.com

Re: large files vs many files

Reply via email to