sryanyuan commented on PR #3375: URL: https://github.com/apache/kvrocks/pull/3375#issuecomment-4053900497
> > > @sryanyuan Would the compaction pressure triggered by a sudden burst of massive deletes be a potential risk point? > > > > > > You're correct that a sudden burst of massive deletes can increase compaction pressure in RocksDB. > > In our implementation, FLUSHSLOTS uses RocksDB's DeleteRange API, which is optimized to mark key ranges for deletion without immediately touching every SST file entry. This reduces write amplification compared to issuing many individual Delete operations. > > However, DeleteRange still leaves tombstones and invalidated data in the LSM tree, which will be cleaned up during subsequent compactions. This GC phase will rewrite SST files to reclaim space, and can cause extra compaction load. > > This command is intended for special operational scenarios — for example, when a cluster is not serving live traffic and an operator needs to bulk-clear data for specific slots. In such cases, temporary compaction pressure is acceptable, and we avoid running it during normal high-load periods. > > To mitigate impact, we recommend: > > > > * Running FLUSHSLOTS during off-peak hours. > > * Limiting the number of slots cleared in a single operation (e.g., only a small fraction of the 16,384 slots at a time), to reduce the volume of deleted data and spread GC load over time. > > * Adjusting compaction options (e.g., background threads, delete range thresholds) if necessary. > > * Optionally forcing manual compaction on affected ranges after deletion. > > Exactly. If DeleteRange removes half of the data at once, it's highly likely to cross RocksDB's internal monitoring thresholds, triggering an involuntary background compaction event finally. This could result in a massive, unpredictable impact on the cluster's performance. > > So I'm wondering if we should avoid calling DeleteRange immediately after slot migration and topology updates. Instead, would it be less impactful on the cluster to simply decommission and destroy the entire instance once all its data has been migrated out? This might be a cleaner way to offload the compaction overhead. > > Since horizontal scaling is the goal, it would be much smoother to split the source node's data and migrate each half to two separate target nodes. Once the migration is finished and the original node is left with zero data, we can simply decommission and destroy the instance. This avoids the DeleteRange overhead entirely. Thanks for the suggestion — I understand your point about avoiding DeleteRange by decommissioning a node after migrating all its slots away. However, my use case is quite different, and this implementation does not affect the existing slot migration logic in Kvrocks. The main scenario for FLUSHSLOTS here is when we need to split a very large cluster into two clusters, and then consolidate data into a new cluster via replication, not slot migration. Specifically: * Create two new clusters, each with one master and one replica. * Change the topology so that the masters of the two new clusters replicate from the old cluster's slaves. For very large datasets, snapshot-based replication is much faster than slot-by-slot migration. * Once replication is complete, merge the two new clusters into a single cluster. At this point, each node will hold data for slots it no longer owns, so we use FLUSHSLOTS to clear those keys. This cleanup happens while the new cluster is not serving external traffic. * After the merge is complete, switch client traffic to the new cluster and decommission the old cluster. In this flow, FLUSHSLOTS is a necessary operation to remove extra slot data after the replication-based merge. The cluster is offline during the flush, so compaction pressure is not a concern for live traffic. Advantages of this approach: * No client-visible impact during data redistribution — unlike slot migration, which generates many MOVED/ASK redirections that clients must handle. * Much faster for large datasets, since snapshot-based replication avoids the per-key overhead of slot migration. * Simpler operationally when splitting or merging clusters at scale. (If replication breaks during the process, incremental sync can be handled by parsing the WAL, similar to the kvrocks2redis approach.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
