Hi Tom (or anyone else), Will SequenceFile allow me to avoid problems with concurrent writes to the file? I stll continue to get the following exceptions/errors in hdfs:
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on client 127.0.0.1 because current leaseholder is trying to recreate file. Only happens when two processes are trying to write at the same time. Now ideally I don't want to buffer the data that's coming in and i want to get it out and into the file asap to avoid any data loss...am i missing something here? is there some sort of factory i can implement to help in writing a lot of simultaneous data streams? thanks in advance for any suggestions -sasha On Wed, May 6, 2009 at 9:40 AM, Tom White <t...@cloudera.com> wrote: > Hi Sasha, > > As you say, HDFS appends are not yet working reliably enough to be > suitable for production use. On the other hand, having lots of little > files is bad for the namenode, and inefficient for MapReduce (see > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so > it's best to avoid this too. > > I would recommend using SequenceFile as a storage container for lots > of small pieces of data. Each key-value pair would represent one of > your little files (you can have a null key, if you only need to store > the contents of the file). You can also enable compression (use block > compression), and SequenceFiles are designed to work well with > MapReduce. > > Cheers, > > Tom > > On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy <sasha.do...@gmail.com> > wrote: > > hi there, > > working through a concept at the moment and was attempting to write lots > of > > data to few files as opposed to writing lots of data to lots of little > > files. what are the thoughts on this? > > > > When I try and implement outputStream = hdfs.append(path); there doesn't > > seem to be any locking mechanism in place ... or there is and it doesn't > > work well enough for many writes per second? > > > > i have read and seen that the property "dfs.support.append" is not meant > for > > production use. still, if millions of little files are as good or better > > --- or no difference -- to a few massive files then i suppose append > isn't > > something i really need. > > > > I do see a lot of stack traces with messages like: > > > > org.apache.hadoop.ipc.RemoteException: > > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to > > create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on > client > > 127.0.0.1 because current leaseholder is trying to recreate file. > > > > i hope this make sense. still a little bit confused. > > > > thanks in advance > > -sd > > > > -- > > Sasha Dolgy > > sasha.do...@gmail.com >