Re: Repair on a slow node (or is it?)

Kane Wilson Mon, 29 Mar 2021 03:32:43 -0700

Check what your compactionthroughput is set to, as it will impact the
validation compactions. also what kind of disks does the DR node have? The
validation compaction sizes are likely fine, I'm not sure of the exact
details but it's normal to expect very large validations.


Rebuilding would not be an ideal mechanism for repairing, and would likely
be slower and chew up a lot of disk space. It's also not guaranteed to give
you data that will be consistent with the other DC, as replicas will only
be streamed from one node.

 I think you're better off looking at setting up regular backups and if you
really need it commitlog backups. The storage would be cheaper and more
reliable, plus less impactful on your production DC. Restoring will also be
a lot easier and faster as well, as restoring from a single node DC will be
network bottlenecked. There are various tools around that do this for you
such as medusa or tablesnap.


raft.so - Cassandra consulting, support, managed services

On Mon., 29 Mar. 2021, 20:47 Lapo Luchini, <l...@lapo.it> wrote:

> Hi all,
>      I have a 6 nodes production cluster with 1.5 TiB load (RF=3) and a
> single-node DC dedicated as a "remote disaster recovery copy" 2.7 TiB.
>
> Doing repairs only on the production cluster takes a semi-decent time
> (24h for the biggest keyspace, which takes 90% of the space), but by
> doing repair across the two DCs takes forever, and segments often fail
> even if I increased Reaper segment time limit to 2h.
>
> In trying to debug the issue, I noticed that "compactionstats -H" on the
> DR node shows huge (and very very slow) validations:
>
> compaction completed  total      unit  progress
> Validation 2.78 GiB   8.11 GiB   bytes 34.33%
> Validation 0 bytes    2.67 TiB   bytes 0.00%
> Validation 1.7 TiB    2.43 TiB   bytes 69.75%
> Validation 124.26 GiB 2.67 TiB   bytes 4.55%
> Validation 536.67 GiB 2.67 TiB   bytes 19.63%
>
> Such validations take a few hours to complete, and as far as I
> understood segment repair always fails on the first try do to those, and
> only has success after a few tries when the original validation executed
> in the first try has ended.
>
> My question is this: is it normal to have to validate all of the
> keyspace content on each segment's validation?
> Is the DB in a "strange" state?
> Would it be useful to issue a "rebuild" on that node, in order to send
> all missing data anyways, and this skipping the lenghty validations?
>
> thanks!
>
> --
> Lapo Luchini
> l...@lapo.it
>
>

Re: Repair on a slow node (or is it?)

Reply via email to