2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 1986 2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986 lastFlushOffset 1731 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 1986 2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39) hdfs.HdfsQueueConsumer: Consumer writing event 2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream 2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream 2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 2235 2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235 lastFlushOffset 1986 2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 2235
So although the Offset is changing as expected, the output stream isn't being flushed or cleared out and isn't being written to file... On Tue, May 12, 2009 at 5:26 PM, Sasha Dolgy <sdo...@gmail.com> wrote: > Right now data is received in parallel and is written to a queue, then a > single thread reads the queue and writes those messages to a > FSDataOutputStream which is kept open, but the messages never get flushed. > Tried flush() and sync() with no joy. > 1. > outputStream.writeBytes(rawMessage.toString()); > > 2. > > log.debug("Flushing stream, size = " + s.getOutputStream().size()); > s.getOutputStream().sync(); > log.debug("Flushed stream, size = " + s.getOutputStream().size()); > > or > > log.debug("Flushing stream, size = " + s.getOutputStream().size()); > s.getOutputStream().flush(); > log.debug("Flushed stream, size = " + s.getOutputStream().size()); > > Just see the size() remain the same after performing this action. > > This is using hadoop-0.20.0. > > -sd > > > On Sun, May 10, 2009 at 4:45 PM, Stefan Podkowinski <spo...@gmail.com>wrote: > >> You just can't have many distributed jobs write into the same file >> without locking/synchronizing these writes. Even with append(). Its >> not different than using a regular file from multiple processes in >> this respect. >> Maybe you need to collect your data in front before processing them in >> hadoop? >> Have a look at Chukwa, http://wiki.apache.org/hadoop/Chukwa >> >> >> On Sat, May 9, 2009 at 9:44 AM, Sasha Dolgy <sdo...@gmail.com> wrote: >> > Would WritableFactories not allow me to open one outputstream and >> continue >> > to write() and sync() ? >> > >> > Maybe I'm reading into that wrong. Although UUID would be nice, it >> would >> > still leave me in the problem of having lots of little files instead of >> a >> > few large files. >> > >> > -sd >> > >> > On Sat, May 9, 2009 at 8:37 AM, jason hadoop <jason.had...@gmail.com> >> wrote: >> > >> >> You must create unique file names, I don't believe (but I do not know) >> that >> >> the append could will allow multiple writers. >> >> >> >> Are you writing from within a task, or as an external application >> writing >> >> into hadoop. >> >> >> >> You may try using UUID, >> >> http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part >> of >> >> your >> >> filename. >> >> Without knowing more about your goals, environment and constraints it >> is >> >> hard to offer any more detailed suggestions. >> >> You could also have an application aggregate the streams and write out >> >> chunks, with one or more writers, one per output file. >> > -- Sasha Dolgy sasha.do...@gmail.com