Kevin Conaway created FLUME-2922:
------------------------------------

             Summary: HDFSSequenceFile Should Sync Writer
                 Key: FLUME-2922
                 URL: https://issues.apache.org/jira/browse/FLUME-2922
             Project: Flume
          Issue Type: Bug
          Components: Sinks+Sources
    Affects Versions: v1.6.0
            Reporter: Kevin Conaway
            Priority: Critical


There is a possibility of losing data with the current HDFS sequence file 
writer.

Internally, the `SequenceFile.Writer` buffers data and periodically syncs it to 
the underlying output stream.  The mechanism for doing this is dependent on 
whether you are using compression or not but in both scenarios, the key/values 
are appended to an internal buffer and only flushed to disk after the buffer 
reaches a certain size.

Thus it is quite possible for Flume to lose messages if the agent crashes, or 
is stopped, before the internal buffer is flushed to disk.

The correct action is to force the writer to sync its internal buffers to the 
underlying `FSDataOutputStream` first before calling hflush/sync.

Additionally, I believe we should be calling hsync instead of hflush.  Its my 
understanding writes with hsync should be more durable which I believe are the 
semantics we want here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to