swuferhong commented on issue #2526:
URL: https://github.com/apache/fluss/issues/2526#issuecomment-3838205114

   > Hi [@luoyuxia](https://github.com/luoyuxia) , I have successfully 
reproduced this issue and identified the root cause.
   > 
   > ### Root Cause
   > The issue is caused by the `ReplicaManager` incorrectly deleting remote 
snapshots during a local replica stop operation (which occurs during 
rebalancing). I traced the log `Delete table's remote bucket snapshot dir... 
success` to `ReplicaManager#stopReplica`. The code logic explicitly deletes the 
remote snapshot if `delete=true` and the node is the leader:
   > 
   > // Original code in ReplicaManager.java
   > remoteLogManager.stopReplica(replicaToDelete, delete && 
replicaToDelete.isLeader());
   > if (delete && replicaToDelete.isLeader()) {
   >     // This incorrectly deletes shared remote files during local cleanup
   >     kvManager.deleteRemoteKvSnapshot(
   >             replicaToDelete.getPhysicalTablePath(), 
replicaToDelete.getTableBucket());
   > }
   > When a rebalance occurs, the primary key table leader is stopped on the 
old node with `delete=true` to clean up local resources. However, the code 
above was also deleting the shared remote snapshot, causing the new leader (on 
a different node) to fail with `KvSnapshotNotExistException` when trying to 
load that data.
   > 
   > ### Fix
   > I have verified that removing the kvManager.deleteRemoteKvSnapshot(...) 
call fixes the issue. Remote snapshots should only be deleted when the table is 
explicitly dropped, not during replica migration.
   > 
   > I have written a regression test KvSnapshotDeletionBugReplicationTest that 
verifies that dropKv (used during rebalance) now correctly cleans up local 
files while preserving the remote snapshot.
   > 
   > Could you please assign this issue to me, I will submit a PR with the fix 
shortly.
   > 
   > Edit: forget to tag you
   
   Good catch! However, the fix shouldn't simply remove the deletion logic. 
Instead, we need to handle the rebalance scenario properly within the `RPC 
StopReplicaRequest`—for example, by introducing a boolean flag like 
`deleteRemoteKvSnapshot`. Let's discuss this fix first under this issue. What 
do you think? @wuchong @luoyuxia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to