ZuebeyirEser opened a new pull request, #2524:
URL: https://github.com/apache/fluss/pull/2524

   ### Purpose
   Linked issue: close #2405
   
   ### Brief change log
   This PR fixes a race condition where a rebalance task could be marked as 
`COMPLETED` before the old replicas were actually stopped and leaders were 
migrated. This was especially visible under high load (e.g., 400 buckets).
   
   The root cause was that `CoordinatorEventProcessor` used a "fire-and-forget" 
approach for `StopReplica` requests. It manually moved replicas to the 
Successful state immediately after sending the RPC.
   
   Major changes:
   * Updated `RebalanceManager` to track `pendingDeletions` for each bucket.
   * Changed `CoordinatorEventProcessor` to wait for the 
`DeleteReplicaResponseReceivedEvent` callback before marking a task as finished.
   * Added logic to handle `DeadTabletServerEvent` so rebalance tasks don't 
hang if a server dies while waiting for a response.
   * Improved `RebalanceManager#buildClusterModel` robustness to skip 
leaderless buckets (avoids crashes during table initialization).
   * Correctly separated state transitions for rebalance replicas (immediate 
cleanup) and table-deletion replicas (cleanup via `TableManager`).
   
   ### Tests
   * Added `RebalanceRaceConditionITCase` (regression test with 50 tables/400 
buckets).
   * Updated `RebalanceITCase` with stricter leader-count assertions.
   * Verified all `CoordinatorEventProcessorTest` cases pass.
   
   ### API and Format
   no
   ### Documentation
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to