[jira] [Created] (KAFKA-4384) ReplicaFetcherThread stopped after ReplicaFetcherThread received a corrupted message

Jun He (JIRA) Sat, 05 Nov 2016 12:02:07 -0700

Jun He created KAFKA-4384:
-----------------------------

             Summary: ReplicaFetcherThread stopped after ReplicaFetcherThread 
received a corrupted message
                 Key: KAFKA-4384
                 URL: https://issues.apache.org/jira/browse/KAFKA-4384
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 0.10.0.1, 0.10.0.0, 0.9.0.1
         Environment: Ubuntu 12.04, AWS D2 instance
            Reporter: Jun He



We recently discovered an issue in Kafka 0.9.0.1 (), where ReplicaFetcherThread 
stopped after ReplicaFetcherThread received a corrupted message. As the same 
logic exists also in Kafka 0.10.0.0 and 0.10.0.1, they may have the similar 
issue.

Here are system logs related to this issue.
2016-10-30 - 02:33:05,606 ERROR ReplicaFetcherThread-5-1590174474 
ReplicaFetcherThread.apply - Found invalid messages during fetch for partition 
[logs,41] offset 39021512238 error Message is corrupt (stored crc = 2028421553, 
computed crc = 3577227678)
2016-10-30 - 02:33:06,582 ERROR ReplicaFetcherThread-5-1590174474 
ReplicaFetcherThread.error - [ReplicaFetcherThread-5-1590174474], Error due to 
kafka.common.KafkaException: - error processing data for partition [logs,41] 
offset 39021512301 Caused - by: java.lang.RuntimeException: Offset mismatch: 
fetched offset = 39021512301, log end offset = 39021512238.

First, ReplicaFetcherThread got a corrupted message (offset 39021512238) due to 
some blip.

Line 
https://github.com/apache/kafka/blob/0.9.0.1/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L138
 threw exception

Then, Line 
https://github.com/apache/kafka/blob/0.9.0.1/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L145
 caught it and logged this error.

Because 
https://github.com/apache/kafka/blob/0.9.0.1/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L134
 updated the topic partition offset to the fetched latest one in partitionMap. 
So ReplicaFetcherThread skipped the batch with corrupted messages. 

Based on 
https://github.com/apache/kafka/blob/0.9.0.1/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L84,
 the ReplicaFetcherThread then directly fetched the next batch of messages 
(with offset 39021512301)

Next, ReplicaFetcherThread stopped because the log end offset (still 
39021512238) didn't match the fetched message (offset 39021512301).

A quick fix is to move line 134 to be after line 138.

Would be great to have your comments and please let me know if a Jira issue is 
needed. Thanks.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4384) ReplicaFetcherThread stopped after ReplicaFetcherThread received a corrupted message

Reply via email to