[ https://issues.apache.org/jira/browse/FLUME-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323307#comment-15323307 ]
Kevin Conaway commented on FLUME-2922: -------------------------------------- Attaching patch via https://patch-diff.githubusercontent.com/raw/apache/flume/pull/52.patch > HDFSSequenceFile Should Sync Writer > ----------------------------------- > > Key: FLUME-2922 > URL: https://issues.apache.org/jira/browse/FLUME-2922 > Project: Flume > Issue Type: Bug > Components: Sinks+Sources > Affects Versions: v1.6.0 > Reporter: Kevin Conaway > Priority: Critical > Attachments: FLUME-2922.patch > > > There is a possibility of losing data with the current HDFS sequence file > writer. > Internally, the `SequenceFile.Writer` buffers data and periodically syncs it > to the underlying output stream. The mechanism for doing this is dependent > on whether you are using compression or not but in both scenarios, the > key/values are appended to an internal buffer and only flushed to disk after > the buffer reaches a certain size. > Thus it is quite possible for Flume to lose messages if the agent crashes, or > is stopped, before the internal buffer is flushed to disk. > The correct action is to force the writer to sync its internal buffers to the > underlying `FSDataOutputStream` first before calling hflush/sync. > Additionally, I believe we should be calling hsync instead of hflush. Its my > understanding writes with hsync should be more durable which I believe are > the semantics we want here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)