sryanyuan commented on PR #3375:
URL: https://github.com/apache/kvrocks/pull/3375#issuecomment-4053900497

   > > > @sryanyuan Would the compaction pressure triggered by a sudden burst 
of massive deletes be a potential risk point?
   > > 
   > > 
   > > You're correct that a sudden burst of massive deletes can increase 
compaction pressure in RocksDB.
   > > In our implementation, FLUSHSLOTS uses RocksDB's DeleteRange API, which 
is optimized to mark key ranges for deletion without immediately touching every 
SST file entry. This reduces write amplification compared to issuing many 
individual Delete operations.
   > > However, DeleteRange still leaves tombstones and invalidated data in the 
LSM tree, which will be cleaned up during subsequent compactions. This GC phase 
will rewrite SST files to reclaim space, and can cause extra compaction load.
   > > This command is intended for special operational scenarios — for 
example, when a cluster is not serving live traffic and an operator needs to 
bulk-clear data for specific slots. In such cases, temporary compaction 
pressure is acceptable, and we avoid running it during normal high-load periods.
   > > To mitigate impact, we recommend:
   > > 
   > > * Running FLUSHSLOTS during off-peak hours.
   > > * Limiting the number of slots cleared in a single operation (e.g., only 
a small fraction of the 16,384 slots at a time), to reduce the volume of 
deleted data and spread GC load over time.
   > > * Adjusting compaction options (e.g., background threads, delete range 
thresholds) if necessary.
   > > * Optionally forcing manual compaction on affected ranges after deletion.
   > 
   > Exactly. If DeleteRange removes half of the data at once, it's highly 
likely to cross RocksDB's internal monitoring thresholds, triggering an 
involuntary background compaction event finally. This could result in a 
massive, unpredictable impact on the cluster's performance.
   > 
   > So I'm wondering if we should avoid calling DeleteRange immediately after 
slot migration and topology updates. Instead, would it be less impactful on the 
cluster to simply decommission and destroy the entire instance once all its 
data has been migrated out? This might be a cleaner way to offload the 
compaction overhead.
   > 
   > Since horizontal scaling is the goal, it would be much smoother to split 
the source node's data and migrate each half to two separate target nodes. Once 
the migration is finished and the original node is left with zero data, we can 
simply decommission and destroy the instance. This avoids the DeleteRange 
overhead entirely.
   
   Thanks for the suggestion — I understand your point about avoiding 
DeleteRange by decommissioning a node after migrating all its slots away. 
However, my use case is quite different, and this implementation does not 
affect the existing slot migration logic in Kvrocks.
   
   The main scenario for FLUSHSLOTS here is when we need to split a very large 
cluster into two clusters, and then consolidate data into a new cluster via 
replication, not slot migration. Specifically:
   
   * Create two new clusters, each with one master and one replica.
   * Change the topology so that the masters of the two new clusters replicate 
from the old cluster's slaves. For very large datasets, snapshot-based 
replication is much faster than slot-by-slot migration.
   * Once replication is complete, merge the two new clusters into a single 
cluster. At this point, each node will hold data for slots it no longer owns, 
so we use FLUSHSLOTS to clear those keys. This cleanup happens while the new 
cluster is not serving external traffic.
   * After the merge is complete, switch client traffic to the new cluster and 
decommission the old cluster.
   In this flow, FLUSHSLOTS is a necessary operation to remove extra slot data 
after the replication-based merge. The cluster is offline during the flush, so 
compaction pressure is not a concern for live traffic.
   
   Advantages of this approach:
   
   * No client-visible impact during data redistribution — unlike slot 
migration, which generates many MOVED/ASK redirections that clients must handle.
   * Much faster for large datasets, since snapshot-based replication avoids 
the per-key overhead of slot migration.
   * Simpler operationally when splitting or merging clusters at scale.
   
   (If replication breaks during the process, incremental sync can be handled 
by parsing the WAL, similar to the kvrocks2redis approach.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to