Re: large files vs many files

Tom White Wed, 06 May 2009 01:40:40 -0700

Hi Sasha,

As you say, HDFS appends are not yet working reliably enough to be
suitable for production use. On the other hand, having lots of little
files is bad for the namenode, and inefficient for MapReduce (see
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so
it's best to avoid this too.


I would recommend using SequenceFile as a storage container for lots
of small pieces of data. Each key-value pair would represent one of
your little files (you can have a null key, if you only need to store
the contents of the file). You can also enable compression (use block
compression), and SequenceFiles are designed to work well with
MapReduce.

Cheers,

Tom

On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy <sasha.do...@gmail.com> wrote:
> hi there,
> working through a concept at the moment and was attempting to write lots of
> data to few files as opposed to writing lots of data to lots of little
> files.  what are the thoughts on this?
>
> When I try and implement outputStream = hdfs.append(path); there doesn't
> seem to be any locking mechanism in place ... or there is and it doesn't
> work well enough for many writes per second?
>
> i have read and seen that the property "dfs.support.append" is not meant for
> production use.  still, if millions of little files are as good or better
> --- or no difference -- to a few massive files then i suppose append isn't
> something i really need.
>
> I do see a lot of stack traces with messages like:
>
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
> create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on client
> 127.0.0.1 because current leaseholder is trying to recreate file.
>
> i hope this make sense.  still a little bit confused.
>
> thanks in advance
> -sd
>
> --
> Sasha Dolgy
> sasha.do...@gmail.com
>

Re: large files vs many files

Reply via email to