[ https://issues.apache.org/jira/browse/KAFKA-12241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364493#comment-17364493 ]
Jack Foy commented on KAFKA-12241: ---------------------------------- In my opinion this is a cleaner fix than the one proposed in https://issues.apache.org/jira/browse/KAFKA-3861 for the same problem. > Partition offline when ISR shrinks to leader and LogDir goes offline > -------------------------------------------------------------------- > > Key: KAFKA-12241 > URL: https://issues.apache.org/jira/browse/KAFKA-12241 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.4.2 > Reporter: Noa Resare > Priority: Major > > This is a long standing issue that we haven't previously tracked in a JIRA. > We experience this maybe once per month on average and we see the following > sequence of events: > # A broker shrinks ISR to just itself for a partition. However, the > followers are at highWatermark:{{ [Partition PARTITION broker=601] Shrinking > ISR from 1501,601,1201,1801 to 601. Leader: (highWatermark: 432385279, > endOffset: 432385280). Out of sync replicas: (brokerId: 1501, endOffset: > 432385279) (brokerId: 1201, endOffset: 432385279) (brokerId: 1801, endOffset: > 432385279).}} > # Around this time (in the case I have in front of me, 20ms earlier > according to the logging subsystem) LogDirFailureChannel captures an Error > while appending records to PARTITION due to a readonly filesystem. > # ~20 ms after the ISR shrink, LogDirFailureHandler offlines the partition: > Logs for partitions LIST_OF_PARTITIONS are offline and logs for future > partitions are offline due to failure on log directory /kafka/d6/data > # ~50ms later the controller marks the replicas as offline from 601: > message: [Controller id=901] Mark replicas LIST_OF_PARTITIONS on broker 601 > as offline > # ~2ms later the controller offlines the partition: [Controller id=901 > epoch=4] Changed partition PARTITION state from OnlinePartition to > OfflinePartition > To resolve this someone needs to manually enable unclean leader election, > which is obviously not ideal. Since the leader knows that all the followers > that are removed from ISR is at highWatermark, maybe it could convey that to > the controller in the LeaderAndIsr response so that the controller could pick > a new leader without having to resort to unclean leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005)