In Hadoop 1.0 (from 0.20-append), there's just a single "sync(…)" output-stream call that does a metadata update to persist the data already written to the under-construction blocks and flushes the open file for the block at the DNs (but does *not* flush the file descriptor of the file at the OS level, via http://linux.die.net/man/2/fsync).
In Hadoop 2.0 (what is 0.23.x today), there are two APIs - hflush and hsync. The former would be akin to the above and old sync(…) call, while the latter is designed to do one step further and call the fsync syscall (http://linux.die.net/man/2/fsync) to ensure that the data is really persisted. However, currently, as of 0.23.2 at least, hsync() isn't completely implemented, and just calls hflush() instead so the behavior is the same. See https://issues.apache.org/jira/browse/HDFS-265 for all the discussion around this change between 1.0 and 2.0. Also see the API docs for both here for their javadocs: http://hadoop.apache.org/common/docs/r0.23.1/api/org/apache/hadoop/fs/FSDataOutputStream.html#hflush() and http://hadoop.apache.org/common/docs/r0.23.1/api/org/apache/hadoop/fs/FSDataOutputStream.html#hsync() Ticket https://issues.apache.org/jira/browse/HDFS-744 tracks completion of the hsync() feature. On Thu, Apr 12, 2012 at 3:59 PM, Inder Pall <inder.p...@gmail.com> wrote: > Folks, > > Can some one shed out more technical details than what the javadoc talks > about. > Also, which one should be used when? > > -- > Thanks, > - Inder > Tech Platforms @Inmobi > Linkedin - http://goo.gl/eR4Ub -- Harsh J