ZuebeyirEser opened a new pull request, #2524: URL: https://github.com/apache/fluss/pull/2524
### Purpose Linked issue: close #2405 ### Brief change log This PR fixes a race condition where a rebalance task could be marked as `COMPLETED` before the old replicas were actually stopped and leaders were migrated. This was especially visible under high load (e.g., 400 buckets). The root cause was that `CoordinatorEventProcessor` used a "fire-and-forget" approach for `StopReplica` requests. It manually moved replicas to the Successful state immediately after sending the RPC. Major changes: * Updated `RebalanceManager` to track `pendingDeletions` for each bucket. * Changed `CoordinatorEventProcessor` to wait for the `DeleteReplicaResponseReceivedEvent` callback before marking a task as finished. * Added logic to handle `DeadTabletServerEvent` so rebalance tasks don't hang if a server dies while waiting for a response. * Improved `RebalanceManager#buildClusterModel` robustness to skip leaderless buckets (avoids crashes during table initialization). * Correctly separated state transitions for rebalance replicas (immediate cleanup) and table-deletion replicas (cleanup via `TableManager`). ### Tests * Added `RebalanceRaceConditionITCase` (regression test with 50 tables/400 buckets). * Updated `RebalanceITCase` with stricter leader-count assertions. * Verified all `CoordinatorEventProcessorTest` cases pass. ### API and Format no ### Documentation no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
