Yoel Cabo Lopez created STORM-2218:
--------------------------------------

             Summary: When using Block Compression in the SequenceFileBolt some 
Tuples may be acked before the data is flushed to HDFS
                 Key: STORM-2218
                 URL: https://issues.apache.org/jira/browse/STORM-2218
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-hdfs
            Reporter: Yoel Cabo Lopez
            Priority: Minor


In AbstractHDFSBolt, the tuples are being acked after calling syncAllWriters(), 
that basically ends up calling doSync() in every writer. In the case of the 
SequenceFileWriter, that is the same as calling the hsync() method of 
SequenceFile.Writer:

https://github.com/apache/storm/blob/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/common/SequenceFileWriter.java#L52

The problem in the case of the block compression is that if there is a 
compression block opened it is not flushed with hsync(), instead it is 
necessary to call the sync() method, that adds a sync marker, compresses the 
block and writes it to the output stream that is flushed with hsync(). This is 
also done automatically when a certain size is reached in the compression 
block, but we cannot have certainty of the data being flushed until we call 
sync() and then hsync():

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java#L1549

The easy fix is just add a call to sync() in case the writer is using Block 
Compression. I'm concerned about the impact that would have in the block size, 
but I think it is the only way of writing the data reliably in this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to