swuferhong commented on issue #2526: URL: https://github.com/apache/fluss/issues/2526#issuecomment-3838205114
> Hi [@luoyuxia](https://github.com/luoyuxia) , I have successfully reproduced this issue and identified the root cause. > > ### Root Cause > The issue is caused by the `ReplicaManager` incorrectly deleting remote snapshots during a local replica stop operation (which occurs during rebalancing). I traced the log `Delete table's remote bucket snapshot dir... success` to `ReplicaManager#stopReplica`. The code logic explicitly deletes the remote snapshot if `delete=true` and the node is the leader: > > // Original code in ReplicaManager.java > remoteLogManager.stopReplica(replicaToDelete, delete && replicaToDelete.isLeader()); > if (delete && replicaToDelete.isLeader()) { > // This incorrectly deletes shared remote files during local cleanup > kvManager.deleteRemoteKvSnapshot( > replicaToDelete.getPhysicalTablePath(), replicaToDelete.getTableBucket()); > } > When a rebalance occurs, the primary key table leader is stopped on the old node with `delete=true` to clean up local resources. However, the code above was also deleting the shared remote snapshot, causing the new leader (on a different node) to fail with `KvSnapshotNotExistException` when trying to load that data. > > ### Fix > I have verified that removing the kvManager.deleteRemoteKvSnapshot(...) call fixes the issue. Remote snapshots should only be deleted when the table is explicitly dropped, not during replica migration. > > I have written a regression test KvSnapshotDeletionBugReplicationTest that verifies that dropKv (used during rebalance) now correctly cleans up local files while preserving the remote snapshot. > > Could you please assign this issue to me, I will submit a PR with the fix shortly. > > Edit: forget to tag you Good catch! However, the fix shouldn't simply remove the deletion logic. Instead, we need to handle the rebalance scenario properly within the `RPC StopReplicaRequest`—for example, by introducing a boolean flag like `deleteRemoteKvSnapshot`. Let's discuss this fix first under this issue. What do you think? @wuchong @luoyuxia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
