Hi Todd, Reading through the JIRA, my impression is that data will be written out to hdfs only once it has reached a certain size in the buffer. Is it possible to define the size of that buffer? Or is this a future enhancement?
-sasha On Fri, May 15, 2009 at 6:14 PM, Todd Lipcon <t...@cloudera.com> wrote: > Hi Sasha, > > What version are you running? Up until very recent versions, sync() was not > implemented. Even in the newest releases, sync isn't completely finished, > and you may find unreliable behavior. > > For now, if you need this kind of behavior, your best bet is to close each > file and then open the next every N minutes. For example, if you're > processing logs every 5 minutes, simply close log file log.00223 and round > robin to log.00224 right before you need the data to be available to > readers. If you're collecting data at a low rate, these files may end up > being rather small, and you should probably look into doing merges on the > hour/day/etc to avoid small-file proliferation. > > If you want to track the work being done around append and sync, check out > HADOOP-5744 and the issues referenced therein: > > http://issues.apache.org/jira/browse/HADOOP-5744 > > Hope that helps, > -Todd > > On Fri, May 15, 2009 at 6:35 AM, Sasha Dolgy <sdo...@gmail.com> wrote: > > > Hi there, forgive the repost: > > > > Right now data is received in parallel and is written to a queue, then a > > single thread reads the queue and writes those messages to a > > FSDataOutputStream which is kept open, but the messages never get > flushed. > > Tried flush() and sync() with no joy. > > > > 1. > > outputStream.writeBytes(rawMessage.toString()); > > > > 2. > > > > log.debug("Flushing stream, size = " + s.getOutputStream().size()); > > s.getOutputStream().sync(); > > log.debug("Flushed stream, size = " + s.getOutputStream().size()); > > > > or > > > > log.debug("Flushing stream, size = " + s.getOutputStream().size()); > > s.getOutputStream().flush(); > > log.debug("Flushed stream, size = " + s.getOutputStream().size()); > > > > The size() remains the same after performing this action. > > > > 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28) > > hdfs.HdfsQueueConsumer: Thread 19 getting an output stream > > 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49) > > hdfs.HdfsQueueConsumer: Re-using existing stream > > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63) > > hdfs.HdfsQueueConsumer: Flushing stream, size = 1986 > > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013) > > hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986 > > lastFlushOffset 1731 > > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66) > > hdfs.HdfsQueueConsumer: Flushed stream, size = 1986 > > 2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39) > > hdfs.HdfsQueueConsumer: Consumer writing event > > 2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28) > > hdfs.HdfsQueueConsumer: Thread 19 getting an output stream > > 2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49) > > hdfs.HdfsQueueConsumer: Re-using existing stream > > 2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63) > > hdfs.HdfsQueueConsumer: Flushing stream, size = 2235 > > 2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013) > > hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235 > > lastFlushOffset 1986 > > 2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66) > > hdfs.HdfsQueueConsumer: Flushed stream, size = 2235 > > > > So although the Offset is changing as expected, the output stream isn't > > being flushed or cleared out and isn't being written to file unless the > > stream is closed() ... is this the expected behaviour? > > > > -sd > > > -- Sasha Dolgy sasha.do...@gmail.com