Check what your compactionthroughput is set to, as it will impact the validation compactions. also what kind of disks does the DR node have? The validation compaction sizes are likely fine, I'm not sure of the exact details but it's normal to expect very large validations.
Rebuilding would not be an ideal mechanism for repairing, and would likely be slower and chew up a lot of disk space. It's also not guaranteed to give you data that will be consistent with the other DC, as replicas will only be streamed from one node. I think you're better off looking at setting up regular backups and if you really need it commitlog backups. The storage would be cheaper and more reliable, plus less impactful on your production DC. Restoring will also be a lot easier and faster as well, as restoring from a single node DC will be network bottlenecked. There are various tools around that do this for you such as medusa or tablesnap. raft.so - Cassandra consulting, support, managed services On Mon., 29 Mar. 2021, 20:47 Lapo Luchini, <l...@lapo.it> wrote: > Hi all, > I have a 6 nodes production cluster with 1.5 TiB load (RF=3) and a > single-node DC dedicated as a "remote disaster recovery copy" 2.7 TiB. > > Doing repairs only on the production cluster takes a semi-decent time > (24h for the biggest keyspace, which takes 90% of the space), but by > doing repair across the two DCs takes forever, and segments often fail > even if I increased Reaper segment time limit to 2h. > > In trying to debug the issue, I noticed that "compactionstats -H" on the > DR node shows huge (and very very slow) validations: > > compaction completed total unit progress > Validation 2.78 GiB 8.11 GiB bytes 34.33% > Validation 0 bytes 2.67 TiB bytes 0.00% > Validation 1.7 TiB 2.43 TiB bytes 69.75% > Validation 124.26 GiB 2.67 TiB bytes 4.55% > Validation 536.67 GiB 2.67 TiB bytes 19.63% > > Such validations take a few hours to complete, and as far as I > understood segment repair always fails on the first try do to those, and > only has success after a few tries when the original validation executed > in the first try has ended. > > My question is this: is it normal to have to validate all of the > keyspace content on each segment's validation? > Is the DB in a "strange" state? > Would it be useful to issue a "rebuild" on that node, in order to send > all missing data anyways, and this skipping the lenghty validations? > > thanks! > > -- > Lapo Luchini > l...@lapo.it > >