[ https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
hudeqi updated KAFKA-16710: --------------------------- Description: The scenario where this case occurs is during a reassignment of a partition: 110879, 110880 (original leader, original follower) ---> 110879, 110880, 110881, 113915 (the latter two replicas are new leader and new follower) ---> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception occurs on the new follower 113915. Through analysis, the exception occurs in the reassignment process: # After the new replicas 110881, 113915 are fully enqueued into the ISR, the controller will switch the leader from 110879 to 110881, and then send a new `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 110881, 113915. # This time, 110881 executes `makeLeader`, and 113915 executes `makeFollower`. After the new follower 113915 completes `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts fetching data from the new leader 110881, but because the log end offset of the new leader 110881 (18735600055) is smaller than the log end offset of the new follower 113915 (18735600059), the new follower 113915 adds the partition to `divergingEndOffsets` during `processFetchRequest` and then executes `truncateOnFetchResponse` to truncate the local log to 18735600055. # However, unfortunately, `truncateOnFetchResponse` needs to acquire the `partitionMapLock` lock, and at the same time, the new leader 110881 and the new follower 113915 also receive another `leaderAndIsr` request from the controller (to remove the old replicas 110879, 110880 from the ISR), and the `ReplicaFetcherManager` thread of the new follower 113915 executes the second `makeFollower` to acquire the `partitionMapLock` lock firstly and execute `removeFetcherForPartitions`, and then gets the local log end offset (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` again to update the fetch offset (18735600059) to the `partitionStates`. # But unfortunately, the follower fetcher thread that was ready to truncate the local log to 18735600055 firstly obtained the `partitionMapLock` lock and completed the truncation, and the log end offset is now 18735600055. # Then, the thread that executed the second `makeFollower` obtained the `partitionMapLock` lock and executed `addFetcherForPartitions` to update the outdated fetch offset (18735600059) to the `partitionStates`. # Finally, it leads to: the follower thread throws the following exception during `processPartitionData`: "java.lang.IllegalStateException: Offset mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = 18735600059, log end offset = 18735600055." The relevant logs are attached. was: The scenario where this case occurs is during a reassignment of a partition: 110879, 110880 (original leader, original follower) ---> 110879, 110880, 110881, 113915 (the latter two replicas are new leader and new follower) ---> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception occurs on the new follower 113915. Through analysis, the exception occurs in the reassignment process: # After the new replicas 110881, 113915 are fully enqueued into the ISR, the controller will switch the leader from 110879 to 110881, and then send a new `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 110881, 113915. # This time, 110881 executes `makeLeader`, and 113915 executes `makeFollower`. After the new follower 113915 completes `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts fetching data from the new leader 110881, but because the log end offset of the new leader 110881 (18735600055) is smaller than the log end offset of the new follower 113915 (18735600059), the new follower 113915 adds the partition to `divergingEndOffsets` during `processFetchRequest` and then executes `truncateOnFetchResponse` to truncate the local log to 18735600055. # However, unfortunately, `truncateOnFetchResponse` needs to acquire the `partitionMapLock` lock, and at the same time, the new leader 110881 and the new follower 113915 also receive another `leaderAndIsr` request from the controller (to remove the old replicas 110879, 110880 from the ISR), and the `ReplicaFetcherManager` thread of the new follower 113915 executes the second `makeFollower` to acquire the `partitionMapLock` lock firstly and execute `removeFetcherForPartitions`, and then gets the local log end offset (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` again to update the fetch offset (18735600059) to the `partitionStates`. # But unfortunately, the follower fetcher thread that was ready to truncate the local log to 18735600055 firstly obtained the `partitionMapLock` lock and completed the truncation, and the log end offset is now 18735600055. # Then, the thread that executed the second `makeFollower` obtained the `partitionMapLock` lock and executed `addFetcherForPartitions` to update the outdated fetch offset (18735600059) to the `partitionStates`. # Finally, it leads to: the follower thread throws the following exception during `processPartitionData`: "java.lang.IllegalStateException: Offset mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = 18735600059, log end offset = 18735600055." > Continuously `makeFollower` may cause the replica fetcher thread to encounter > an offset mismatch exception when `processPartitionData` > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-16710 > URL: https://issues.apache.org/jira/browse/KAFKA-16710 > Project: Kafka > Issue Type: Bug > Components: core, replication > Affects Versions: 2.8.1 > Reporter: hudeqi > Assignee: hudeqi > Priority: Blocker > > The scenario where this case occurs is during a reassignment of a partition: > 110879, 110880 (original leader, original follower) ---> 110879, 110880, > 110881, 113915 (the latter two replicas are new leader and new follower) ---> > 110881, 113915 (new leader, new follower). The "Offset mismatch" exception > occurs on the new follower 113915. > Through analysis, the exception occurs in the reassignment process: > # After the new replicas 110881, 113915 are fully enqueued into the ISR, the > controller will switch the leader from 110879 to 110881, and then send a new > `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to > 110881, 113915. > # This time, 110881 executes `makeLeader`, and 113915 executes > `makeFollower`. After the new follower 113915 completes > `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts > fetching data from the new leader 110881, but because the log end offset of > the new leader 110881 (18735600055) is smaller than the log end offset of the > new follower 113915 (18735600059), the new follower 113915 adds the partition > to `divergingEndOffsets` during `processFetchRequest` and then executes > `truncateOnFetchResponse` to truncate the local log to 18735600055. > # However, unfortunately, `truncateOnFetchResponse` needs to acquire the > `partitionMapLock` lock, and at the same time, the new leader 110881 and the > new follower 113915 also receive another `leaderAndIsr` request from the > controller (to remove the old replicas 110879, 110880 from the ISR), and the > `ReplicaFetcherManager` thread of the new follower 113915 executes the second > `makeFollower` to acquire the `partitionMapLock` lock firstly and execute > `removeFetcherForPartitions`, and then gets the local log end offset > (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` > again to update the fetch offset (18735600059) to the `partitionStates`. > # But unfortunately, the follower fetcher thread that was ready to truncate > the local log to 18735600055 firstly obtained the `partitionMapLock` lock and > completed the truncation, and the log end offset is now 18735600055. > # Then, the thread that executed the second `makeFollower` obtained the > `partitionMapLock` lock and executed `addFetcherForPartitions` to update the > outdated fetch offset (18735600059) to the `partitionStates`. > # Finally, it leads to: the follower thread throws the following exception > during `processPartitionData`: "java.lang.IllegalStateException: Offset > mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = > 18735600059, log end offset = 18735600055." > > The relevant logs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)