Re: FSDataOutputStream flush() not working?

Sasha Dolgy Mon, 18 May 2009 07:00:00 -0700

Hi Jason,

If the bufferSize is set when the stream is created, when the size is
reached, will it automatically write itself out to HDFS?  What happens
when the buffer size is exceeded?


-sasha

On Mon, May 18, 2009 at 3:04 AM, jason hadoop <jason.had...@gmail.com> wrote:
> When you open a file you have the option, blockSize
>  /**
>   * Opens an FSDataOutputStream at the indicated Path with write-progress
>   * reporting.
>   * @param f the file name to open
>   * @param permission
>   * @param overwrite if a file with this name already exists, then if true,
>   *   the file will be overwritten, and if false an error will be thrown.
>   * @param bufferSize the size of the buffer to be used.
>   * @param replication required block replication for the file.
>   * @param blockSize
>   * @param progress
>   * @throws IOException
>   * @see #setPermission(Path, FsPermission)
>   */
>  public abstract FSDataOutputStream create(Path f,
>      FsPermission permission,
>      boolean overwrite,
>      int bufferSize,
>      short replication,
>      long blockSize,
>      Progressable progress) throws IOException;
>
> On Fri, May 15, 2009 at 12:44 PM, Sasha Dolgy <sdo...@gmail.com> wrote:
>
>> Hi Todd,
>> Reading through the JIRA, my impression is that data will be written out to
>> hdfs only once it has reached a certain size in the buffer.  Is it possible
>> to define the size of that buffer?  Or is this a future enhancement?
>>
>> -sasha
>>
>> On Fri, May 15, 2009 at 6:14 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>> > Hi Sasha,
>> >
>> > What version are you running? Up until very recent versions, sync() was
>> not
>> > implemented. Even in the newest releases, sync isn't completely finished,
>> > and you may find unreliable behavior.
>> >
>> > For now, if you need this kind of behavior, your best bet is to close
>> each
>> > file and then open the next every N minutes. For example, if you're
>> > processing logs every 5 minutes, simply close log file log.00223 and
>> round
>> > robin to log.00224 right before you need the data to be available to
>> > readers. If you're collecting data at a low rate, these files may end up
>> > being rather small, and you should probably look into doing merges on the
>> > hour/day/etc to avoid small-file proliferation.
>> >
>> > If you want to track the work being done around append and sync, check
>> out
>> > HADOOP-5744 and the issues referenced therein:
>> >
>> > http://issues.apache.org/jira/browse/HADOOP-5744
>> >
>> > Hope that helps,
>> > -Todd
>> >
>> > On Fri, May 15, 2009 at 6:35 AM, Sasha Dolgy <sdo...@gmail.com> wrote:
>> >
>> > > Hi there, forgive the repost:
>> > >
>> > > Right now data is received in parallel and is written to a queue, then
>> a
>> > > single thread reads the queue and writes those messages to a
>> > > FSDataOutputStream which is kept open, but the messages never get
>> > flushed.
>> > >  Tried flush() and sync() with no joy.
>> > >
>> > > 1.
>> > > outputStream.writeBytes(rawMessage.toString());
>> > >
>> > > 2.
>> > >
>> > > log.debug("Flushing stream, size = " + s.getOutputStream().size());
>> > > s.getOutputStream().sync();
>> > > log.debug("Flushed stream, size = " + s.getOutputStream().size());
>> > >
>> > > or
>> > >
>> > > log.debug("Flushing stream, size = " + s.getOutputStream().size());
>> > > s.getOutputStream().flush();
>> > > log.debug("Flushed stream, size = " + s.getOutputStream().size());
>> > >
>> > > The size() remains the same after performing this action.
>> > >
>> > > 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28)
>> > > hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
>> > > 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49)
>> > > hdfs.HdfsQueueConsumer: Re-using existing stream
>> > > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63)
>> > > hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
>> > > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013)
>> > > hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986
>> > > lastFlushOffset 1731
>> > > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66)
>> > > hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
>> > > 2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39)
>> > > hdfs.HdfsQueueConsumer: Consumer writing event
>> > > 2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28)
>> > > hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
>> > > 2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49)
>> > > hdfs.HdfsQueueConsumer: Re-using existing stream
>> > > 2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63)
>> > > hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
>> > > 2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013)
>> > > hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235
>> > > lastFlushOffset 1986
>> > > 2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66)
>> > > hdfs.HdfsQueueConsumer: Flushed stream, size = 2235
>> > >
>> > > So although the Offset is changing as expected, the output stream isn't
>> > > being flushed or cleared out and isn't being written to file unless the
>> > > stream is closed() ... is this the expected behaviour?
>> > >
>> > > -sd

Re: FSDataOutputStream flush() not working?

Reply via email to