[ https://issues.apache.org/jira/browse/KAFKA-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jun Rao updated KAFKA-573: -------------------------- Attachment: kafka-573.patch Attach a patch. There are 2 problems. The first one is the most severe one. We recently changed FileMessageSet to remove the mutable flag. As a result, everytime a new FileMessageSet is created, the constructor sets the file channel's position to the end of the file. What's happening is that while a file channel is being appended for newly produced data, the file position is moved by FileMessageSet created for fetch requests. Since they are not properly synchronized, occasionally, a message in the log is overwritten. The second issue is that in ByteBufferMessageSet.writeTo. We try to reset the buffer position after writing the data in the buffer to the channel. However, since there is no guarantee that the whole buffer will be written to the channel in a single write, resetting the buffer position could cause incorrect bytes being written to the channel. The patch fixes both issues. The changes are: (1) Added a new flag in the constructor of FileMessageSet to control whether the channel position is set to the end of the file or not. (2) Changed ByteBufferMessageSet.writeTo so that we wait until the whole buffer is written to the channel before resetting the buffer position. (3) Added a few more logging that I found useful while investigating the issues. The system test passes now. > System Test : Leader Failure Log Segment Checksum Mismatched When > request-num-acks is 1 > --------------------------------------------------------------------------------------- > > Key: KAFKA-573 > URL: https://issues.apache.org/jira/browse/KAFKA-573 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8 > Reporter: John Fung > Fix For: 0.8 > > Attachments: acks1_leader_failure_data_loss.tar.gz, kafka-573.patch, > kafka-573-reproduce-issue.patch > > > • Test Description: > 1. Start a 3-broker cluster as source > 2. Send messages to source cluster > 3. Find leader and terminate it (kill -15) > 4. Start the broker again > 5. Start a consumer to consume data > 6. Compare the MessageID in the data between producer log and consumer log. > • Issue: There will be data loss if request-num-acks is set to 1. > • To reproduce this issue, please do the followings: > 1. Download the latest 0.8 branch > 2. Apply the patch attached to this JIRA > 3. Build kafka by running "./sbt update package" > 4. Execute the test in directory "system_test" : "python -B > system_test_runner.py" > 5. This test will execute testcase_2 with the following settings: > Replica factor : 3 > No. of partitions : 1 > No. of bouncing : 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira