Hi Todd,
Reading through the JIRA, my impression is that data will be written out to
hdfs only once it has reached a certain size in the buffer.  Is it possible
to define the size of that buffer?  Or is this a future enhancement?

-sasha

On Fri, May 15, 2009 at 6:14 PM, Todd Lipcon <t...@cloudera.com> wrote:

> Hi Sasha,
>
> What version are you running? Up until very recent versions, sync() was not
> implemented. Even in the newest releases, sync isn't completely finished,
> and you may find unreliable behavior.
>
> For now, if you need this kind of behavior, your best bet is to close each
> file and then open the next every N minutes. For example, if you're
> processing logs every 5 minutes, simply close log file log.00223 and round
> robin to log.00224 right before you need the data to be available to
> readers. If you're collecting data at a low rate, these files may end up
> being rather small, and you should probably look into doing merges on the
> hour/day/etc to avoid small-file proliferation.
>
> If you want to track the work being done around append and sync, check out
> HADOOP-5744 and the issues referenced therein:
>
> http://issues.apache.org/jira/browse/HADOOP-5744
>
> Hope that helps,
> -Todd
>
> On Fri, May 15, 2009 at 6:35 AM, Sasha Dolgy <sdo...@gmail.com> wrote:
>
> > Hi there, forgive the repost:
> >
> > Right now data is received in parallel and is written to a queue, then a
> > single thread reads the queue and writes those messages to a
> > FSDataOutputStream which is kept open, but the messages never get
> flushed.
> >  Tried flush() and sync() with no joy.
> >
> > 1.
> > outputStream.writeBytes(rawMessage.toString());
> >
> > 2.
> >
> > log.debug("Flushing stream, size = " + s.getOutputStream().size());
> > s.getOutputStream().sync();
> > log.debug("Flushed stream, size = " + s.getOutputStream().size());
> >
> > or
> >
> > log.debug("Flushing stream, size = " + s.getOutputStream().size());
> > s.getOutputStream().flush();
> > log.debug("Flushed stream, size = " + s.getOutputStream().size());
> >
> > The size() remains the same after performing this action.
> >
> > 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28)
> > hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
> > 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49)
> > hdfs.HdfsQueueConsumer: Re-using existing stream
> > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63)
> > hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
> > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013)
> > hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986
> > lastFlushOffset 1731
> > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66)
> > hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
> > 2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39)
> > hdfs.HdfsQueueConsumer: Consumer writing event
> > 2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28)
> > hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
> > 2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49)
> > hdfs.HdfsQueueConsumer: Re-using existing stream
> > 2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63)
> > hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
> > 2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013)
> > hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235
> > lastFlushOffset 1986
> > 2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66)
> > hdfs.HdfsQueueConsumer: Flushed stream, size = 2235
> >
> > So although the Offset is changing as expected, the output stream isn't
> > being flushed or cleared out and isn't being written to file unless the
> > stream is closed() ... is this the expected behaviour?
> >
> > -sd
> >
>



-- 
Sasha Dolgy
sasha.do...@gmail.com

Reply via email to