Re: large files vs many files

Sasha Dolgy Fri, 08 May 2009 11:47:09 -0700

Hi Tom (or anyone else),
Will SequenceFile allow me to avoid problems with concurrent writes to the
file?  I stll continue to get the following exceptions/errors in hdfs:


org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528
on client 127.0.0.1 because current leaseholder is trying to recreate file.

Only happens when two processes are trying to write at the same time.  Now
ideally I don't want to buffer the data that's coming in and i want to get
it out and into the file asap to avoid any data loss...am i missing
something here?  is there some sort of factory i can implement to help in
writing a lot of simultaneous data streams?

thanks in advance for any suggestions
-sasha

On Wed, May 6, 2009 at 9:40 AM, Tom White <t...@cloudera.com> wrote:

> Hi Sasha,
>
> As you say, HDFS appends are not yet working reliably enough to be
> suitable for production use. On the other hand, having lots of little
> files is bad for the namenode, and inefficient for MapReduce (see
> http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so
> it's best to avoid this too.
>
> I would recommend using SequenceFile as a storage container for lots
> of small pieces of data. Each key-value pair would represent one of
> your little files (you can have a null key, if you only need to store
> the contents of the file). You can also enable compression (use block
> compression), and SequenceFiles are designed to work well with
> MapReduce.
>
> Cheers,
>
> Tom
>
> On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy <sasha.do...@gmail.com>
> wrote:
> > hi there,
> > working through a concept at the moment and was attempting to write lots
> of
> > data to few files as opposed to writing lots of data to lots of little
> > files.  what are the thoughts on this?
> >
> > When I try and implement outputStream = hdfs.append(path); there doesn't
> > seem to be any locking mechanism in place ... or there is and it doesn't
> > work well enough for many writes per second?
> >
> > i have read and seen that the property "dfs.support.append" is not meant
> for
> > production use.  still, if millions of little files are as good or better
> > --- or no difference -- to a few massive files then i suppose append
> isn't
> > something i really need.
> >
> > I do see a lot of stack traces with messages like:
> >
> > org.apache.hadoop.ipc.RemoteException:
> > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
> > create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on
> client
> > 127.0.0.1 because current leaseholder is trying to recreate file.
> >
> > i hope this make sense.  still a little bit confused.
> >
> > thanks in advance
> > -sd
> >
> > --
> > Sasha Dolgy
> > sasha.do...@gmail.com
>

Re: large files vs many files

Reply via email to