In Cassandra, "repair" refers to anti-entropy repairs. I think that's where most of the confusion is. DBAs see the word "repair" and think it is a one-off operation to "fix something broken". Users incorrectly think that once it is fixed then there shouldn't be a need to repair again.
However in a distributed environment, the reality is that replicas can get out of sync for whatever reason -- nodes going offline, nodes temporarily unresponsive, nodes suffering from a hardware failure, etc. Entropy ensues. It is necessary to keep the data consistent across the cluster so we run anti-entropy repairs. The recommendation is that you run repairs at least once every gc_grace_seconds (GCGS). GCGS by default is 10 days so a good rule of thumb is to run repairs once a week. Let me address some of the points you raised. > ... we run into things like "running repairs", "running compactions", > understand tombstones (row and range), TTLs, etc etc becomes critical as > data is growing. > Compactions are part of the normal operation of Cassandra. You shouldn't however be manually running compactions. If you are, something is wrong and it's most likely a band-aid solution to an underlying problem you need to address. > But on the other hand we also see often lots of warnings... Like "if you > start Cassandra Reaper you can not stop doing that" ... > As above, you need to run repairs regularly. It isn't a one-off operation. Reaper is a good tool for managing repairs in an automated fashion. Here are some useful resources on repairs in Cassandra: - Repair document @ the Apache website - https://cassandra.apache.org/doc/latest/operating/repair.html - DataStax Academy video on Repair - https://www.youtube.com/watch?v=5V5rGDTHs20 - YouTube playlist on DataStax Academy Cassandra Operations course - https://www.youtube.com/playlist?list=PL2g2h-wyI4SrHMlHBJVe_or_Ryek2THgQ - DataStax Doc on when to run repairs - https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsRepairNodesWhen.html Cheers! >