[ https://issues.apache.org/jira/browse/KAFKA-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337705#comment-17337705 ]
Ismael Juma edited comment on KAFKA-7635 at 7/14/21, 1:14 PM: -------------------------------------------------------------- This bug has been fixed by KIP-461: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-461+-+Improve+Replica+Fetcher+behavior+at+handling+partition+failure] github commit: [https://github.com/apache/kafka/commit/414852c701763b6f8362b44d156753b6c3ef247a# |https://github.com/confluentinc/ce-kafka/commit/414852c701763b6f8362b44d156753b6c3ef247a#] Earliest available release: [https://github.com/apache/kafka/releases/tag/2.3.0|https://github.com/confluentinc/ce-kafka/releases/tag/2.3.1] was (Author: yding): This bug has been fixed by KIP-461: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-461+-+Improve+Replica+Fetcher+behavior+at+handling+partition+failure] github commit:[ https://github.com/apache/kafka/commit/414852c701763b6f8362b44d156753b6c3ef247|https://github.com/confluentinc/ce-kafka/commit/414852c701763b6f8362b44d156753b6c3ef247a#] Earliest available release: [https://github.com/apache/kafka/releases/tag/2.3.0|https://github.com/confluentinc/ce-kafka/releases/tag/2.3.1] > FetcherThread stops processing after "Error processing data for partition" > -------------------------------------------------------------------------- > > Key: KAFKA-7635 > URL: https://issues.apache.org/jira/browse/KAFKA-7635 > Project: Kafka > Issue Type: Bug > Components: replication > Affects Versions: 2.0.0 > Reporter: Steven Aerts > Priority: Major > Attachments: stacktraces.txt > > > After disabling unclean leader leader again after recovery of a situation > where we enabled unclean leader due to a split brain in zookeeper, we saw > that some of our brokers stopped replicating their partitions. > Digging into the logs, we saw that the replica thread was stopped because one > partition had a failure which threw a [{{Error processing data for > partition}} > exception|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L207]. > But the broker kept running and serving the partitions from which it was > leader. > We saw three different types of exceptions triggering this (example > stacktraces attached): > * {{kafka.common.UnexpectedAppendOffsetException}} > * {{Trying to roll a new log segment for topic partition partition-b-97 with > start offset 1388 while it already exists.}} > * {{Kafka scheduler is not running.}} > We think there are two acceptable ways for the kafka broker to handle this: > * Mark those partitions as a partition with error and handle them > accordingly. As is done [when a {{CorruptRecordException}} or > {{KafkaStorageException}}|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L196] > is thrown. > * Exit the broker as is done [when log truncation is not > allowed|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/ReplicaFetcherThread.scala#L189]. > > Maybe even a combination of both. Our probably naive idea is that for the > first two types the first strategy would be the best, but for the last type, > it is probably better to re-throw a {{FatalExitError}} and exit the broker. -- This message was sent by Atlassian Jira (v8.3.4#803005)