[ 
https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-16710:
---------------------------
    Affects Version/s: 3.8.0

> Continuously `makeFollower` may cause the replica fetcher thread to encounter 
> an offset mismatch exception when `processPartitionData`
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-16710
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16710
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, replication
>    Affects Versions: 2.8.1, 3.8.0
>            Reporter: hudeqi
>            Assignee: hudeqi
>            Priority: Blocker
>         Attachments: 企业微信截图_230257fe-1c11-4e77-93b3-b8b8edce2ba3.png, 
> 企业微信截图_a5d3e50f-6982-43f7-9263-5e3c5b49cc1e.png, 
> 企业微信截图_e47e04cf-dc5d-49e6-b32d-ba2934c8a50a.png
>
>
> The scenario where this case occurs is during a reassignment of a partition: 
> 110879, 110880 (original leader, original follower) ---> 110879, 110880, 
> 110881, 113915 (the latter two replicas are new leader and new follower) ---> 
> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception 
> occurs on the new follower 113915.
> Through analysis, the exception occurs in the reassignment process:
>  # After the new replicas 110881, 113915 are fully enqueued into the ISR, the 
> controller will switch the leader from 110879 to 110881, and then send a new 
> `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to 
> 110881, 113915.
>  # This time, 110881 executes `makeLeader`, and 113915 executes 
> `makeFollower`. After the new follower 113915 completes 
> `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts 
> fetching data from the new leader 110881, but because the log end offset of 
> the new leader 110881 (18735600055) is smaller than the log end offset of the 
> new follower 113915 (18735600059), the new follower 113915 adds the partition 
> to `divergingEndOffsets` during `processFetchRequest` and then executes 
> `truncateOnFetchResponse` to truncate the local log to 18735600055.
>  # However, unfortunately, `truncateOnFetchResponse` needs to acquire the 
> `partitionMapLock` lock, and at the same time, the new leader 110881 and the 
> new follower 113915 also receive another `leaderAndIsr` request from the 
> controller (to remove the old replicas 110879, 110880 from the ISR), and the 
> `ReplicaFetcherManager` thread of the new follower 113915 executes the second 
> `makeFollower` to acquire the `partitionMapLock` lock firstly and execute 
> `removeFetcherForPartitions`, and then gets the local log end offset 
> (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions` 
> again to update the fetch offset (18735600059) to the `partitionStates`.
>  # But unfortunately, the follower fetcher thread that was ready to truncate 
> the local log to 18735600055 firstly obtained the `partitionMapLock` lock and 
> completed the truncation, and the log end offset is now 18735600055.
>  # Then, the thread that executed the second `makeFollower` obtained the 
> `partitionMapLock` lock and executed `addFetcherForPartitions` to update the 
> outdated fetch offset (18735600059) to the `partitionStates`.
>  # Finally, it leads to: the follower thread throws the following exception 
> during `processPartitionData`: "java.lang.IllegalStateException: Offset 
> mismatch for partition aiops-adplatform-interfacelog-191: fetched offset = 
> 18735600059, log end offset = 18735600055."
>  
> The relevant logs are attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to