yes, that is the problem. two or hundreds...data streams in very quickly. On Fri, May 8, 2009 at 8:42 PM, jason hadoop <jason.had...@gmail.com> wrote:
> Is it possible that two tasks are trying to write to the same file path? > > > On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy <sdo...@gmail.com> wrote: > > > Hi Tom (or anyone else), > > Will SequenceFile allow me to avoid problems with concurrent writes to > the > > file? I stll continue to get the following exceptions/errors in hdfs: > > > > org.apache.hadoop.ipc.RemoteException: > > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: > > failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for > > DFSClient_-1821265528 > > on client 127.0.0.1 because current leaseholder is trying to recreate > file. > > > > Only happens when two processes are trying to write at the same time. > Now > > ideally I don't want to buffer the data that's coming in and i want to > get > > it out and into the file asap to avoid any data loss...am i missing > > something here? is there some sort of factory i can implement to help in > > writing a lot of simultaneous data streams? > > > > thanks in advance for any suggestions > > -sasha > > > > On Wed, May 6, 2009 at 9:40 AM, Tom White <t...@cloudera.com> wrote: > > > > > Hi Sasha, > > > > > > As you say, HDFS appends are not yet working reliably enough to be > > > suitable for production use. On the other hand, having lots of little > > > files is bad for the namenode, and inefficient for MapReduce (see > > > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so > > > it's best to avoid this too. > > > > > > I would recommend using SequenceFile as a storage container for lots > > > of small pieces of data. Each key-value pair would represent one of > > > your little files (you can have a null key, if you only need to store > > > the contents of the file). You can also enable compression (use block > > > compression), and SequenceFiles are designed to work well with > > > MapReduce. > > > > > > Cheers, > > > > > > Tom > > > > > > On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy <sasha.do...@gmail.com> > > > wrote: > > > > hi there, > > > > working through a concept at the moment and was attempting to write > > lots > > > of > > > > data to few files as opposed to writing lots of data to lots of > little > > > > files. what are the thoughts on this? > > > > > > > > When I try and implement outputStream = hdfs.append(path); there > > doesn't > > > > seem to be any locking mechanism in place ... or there is and it > > doesn't > > > > work well enough for many writes per second? > > > > > > > > i have read and seen that the property "dfs.support.append" is not > > meant > > > for > > > > production use. still, if millions of little files are as good or > > better > > > > --- or no difference -- to a few massive files then i suppose append > > > isn't > > > > something i really need. > > > > > > > > I do see a lot of stack traces with messages like: > > > > > > > > org.apache.hadoop.ipc.RemoteException: > > > > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed > to > > > > create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on > > > client > > > > 127.0.0.1 because current leaseholder is trying to recreate file. > > > > > > > > i hope this make sense. still a little bit confused. > > > > > > > > thanks in advance > > > > -sd > > > > > > > > -- > > > > Sasha Dolgy > > > > sasha.do...@gmail.com > > > > > > > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 > www.prohadoopbook.com a community for Hadoop Professionals > -- Sasha Dolgy sasha.do...@gmail.com