When you open a file you have the option, blockSize
  /**
   * Opens an FSDataOutputStream at the indicated Path with write-progress
   * reporting.
   * @param f the file name to open
   * @param permission
   * @param overwrite if a file with this name already exists, then if true,
   *   the file will be overwritten, and if false an error will be thrown.
   * @param bufferSize the size of the buffer to be used.
   * @param replication required block replication for the file.
   * @param blockSize
   * @param progress
   * @throws IOException
   * @see #setPermission(Path, FsPermission)
   */
  public abstract FSDataOutputStream create(Path f,
      FsPermission permission,
      boolean overwrite,
      int bufferSize,
      short replication,
      long blockSize,
      Progressable progress) throws IOException;

On Fri, May 15, 2009 at 12:44 PM, Sasha Dolgy <sdo...@gmail.com> wrote:

> Hi Todd,
> Reading through the JIRA, my impression is that data will be written out to
> hdfs only once it has reached a certain size in the buffer.  Is it possible
> to define the size of that buffer?  Or is this a future enhancement?
>
> -sasha
>
> On Fri, May 15, 2009 at 6:14 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
> > Hi Sasha,
> >
> > What version are you running? Up until very recent versions, sync() was
> not
> > implemented. Even in the newest releases, sync isn't completely finished,
> > and you may find unreliable behavior.
> >
> > For now, if you need this kind of behavior, your best bet is to close
> each
> > file and then open the next every N minutes. For example, if you're
> > processing logs every 5 minutes, simply close log file log.00223 and
> round
> > robin to log.00224 right before you need the data to be available to
> > readers. If you're collecting data at a low rate, these files may end up
> > being rather small, and you should probably look into doing merges on the
> > hour/day/etc to avoid small-file proliferation.
> >
> > If you want to track the work being done around append and sync, check
> out
> > HADOOP-5744 and the issues referenced therein:
> >
> > http://issues.apache.org/jira/browse/HADOOP-5744
> >
> > Hope that helps,
> > -Todd
> >
> > On Fri, May 15, 2009 at 6:35 AM, Sasha Dolgy <sdo...@gmail.com> wrote:
> >
> > > Hi there, forgive the repost:
> > >
> > > Right now data is received in parallel and is written to a queue, then
> a
> > > single thread reads the queue and writes those messages to a
> > > FSDataOutputStream which is kept open, but the messages never get
> > flushed.
> > >  Tried flush() and sync() with no joy.
> > >
> > > 1.
> > > outputStream.writeBytes(rawMessage.toString());
> > >
> > > 2.
> > >
> > > log.debug("Flushing stream, size = " + s.getOutputStream().size());
> > > s.getOutputStream().sync();
> > > log.debug("Flushed stream, size = " + s.getOutputStream().size());
> > >
> > > or
> > >
> > > log.debug("Flushing stream, size = " + s.getOutputStream().size());
> > > s.getOutputStream().flush();
> > > log.debug("Flushed stream, size = " + s.getOutputStream().size());
> > >
> > > The size() remains the same after performing this action.
> > >
> > > 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28)
> > > hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
> > > 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49)
> > > hdfs.HdfsQueueConsumer: Re-using existing stream
> > > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63)
> > > hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
> > > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013)
> > > hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986
> > > lastFlushOffset 1731
> > > 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66)
> > > hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
> > > 2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39)
> > > hdfs.HdfsQueueConsumer: Consumer writing event
> > > 2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28)
> > > hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
> > > 2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49)
> > > hdfs.HdfsQueueConsumer: Re-using existing stream
> > > 2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63)
> > > hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
> > > 2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013)
> > > hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235
> > > lastFlushOffset 1986
> > > 2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66)
> > > hdfs.HdfsQueueConsumer: Flushed stream, size = 2235
> > >
> > > So although the Offset is changing as expected, the output stream isn't
> > > being flushed or cleared out and isn't being written to file unless the
> > > stream is closed() ... is this the expected behaviour?
> > >
> > > -sd
> > >
> >
>
>
>
> --
> Sasha Dolgy
> sasha.do...@gmail.com
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Reply via email to