[ https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
hudeqi updated KAFKA-16710: --------------------------- Affects Version/s: 3.8.0 > Continuously `makeFollower` may cause the replica fetcher thread to encounter > an offset mismatch exception when `processPartitionData` > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-16710 > URL: https://issues.apache.org/jira/browse/KAFKA-16710 > Project: Kafka > Issue Type: Bug > Components: core, replication > Affects Versions: 2.8.1, 3.8.0 > Reporter: hudeqi > Assignee: hudeqi > Priority: Blocker > Attachments: 企业微信截图_230257fe-1c11-4e77-93b3-b8b8edce2ba3.png, > 企业微信截图_a5d3e50f-6982-43f7-9263-5e3c5b49cc1e.png, > 企业微信截图_e47e04cf-dc5d-49e6-b32d-ba2934c8a50a.png > > > The scenario where this case occurs is during a reassignment of a partition: > 110879, 110880 (original leader, original follower) ---> 110879, 110880, > 110881, 113915 (the latter two replicas are new leader and new follower) ---> > 110881, 113915 (new leader, new follower). The "Offset mismatch" exception > occurs on the new follower 113915. > Through analysis, the exception occurs in the reassignment process: > # After the new replicas 110881, 113915 are fully enqueued into the ISR, the > controller will switch the leader from 110879 to 110881, and then send a new > `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to > 110881, 113915. > # This time, 110881 executes `makeLeader`, and 113915 executes > `makeFollower`. After the new follower 113915 completes > `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts > fetching data from the new leader 110881, but because the log end offset of > the new leader 110881 (18735600055) is smaller than the log end offset of the > new follower 113915 (18735600059), the new follower 113915 adds the partition > to `divergingEndOffsets` during `processFetchRequest` and then executes > `truncateOnFetchResponse` to truncate the local log to 18735600055. > # However, unfortunately, `truncateOnFetchResponse` needs to acquire the > `partitionMapLock` lock, and at the same time, the new leader 110881 and the > new follower 113915 also receive another `leaderAndIsr` request from the > controller (to remove the old replicas 110879, 110880 from the ISR), and the > `ReplicaFetcherManager` thread of the new follower 113915 executes the second > `makeFollower` to acquire the `partitionMapLock` lock firstly and execute > `removeFetcherForPartitions`, and then gets the local log end offset > (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` > again to update the fetch offset (18735600059) to the `partitionStates`. > # But unfortunately, the follower fetcher thread that was ready to truncate > the local log to 18735600055 firstly obtained the `partitionMapLock` lock and > completed the truncation, and the log end offset is now 18735600055. > # Then, the thread that executed the second `makeFollower` obtained the > `partitionMapLock` lock and executed `addFetcherForPartitions` to update the > outdated fetch offset (18735600059) to the `partitionStates`. > # Finally, it leads to: the follower thread throws the following exception > during `processPartitionData`: "java.lang.IllegalStateException: Offset > mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = > 18735600059, log end offset = 18735600055." > > The relevant logs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)