Full repair running for an entire week sounds excessively long. Even if
you've got 1 TB of data per node, 1 week means the repair speed is less
than 2 MB/s, that's very slow. Perhaps you should focus on finding the
bottleneck of the full repair speed and work on that instead.
On 03/02/2024 16:18, Sebastian Marsching wrote:
Hi,
2. use an orchestration tool, such as Cassandra Reaper, to take care
of that for you. You will still need monitor and alert to ensure the
repairs are run successfully, but fixing a stuck or failed repair is
not very time sensitive, you can usually leave it till Monday morning
if it happens at Friday night.
Does anyone know how such a schedule can be created in Cassandra Reaper?
I recently learned the hard way that running both a full and an
incremental repair for the same keyspace and table in parallel is not
a good idea (it caused a very unpleasant overload situation on one of
our clusters).
At the moment, we have one schedule for the full repairs (every 90
days) and another schedule for the incremental repairs (daily). But as
full repairs take much longer than a day (about a week, in our case),
the two schedules collide. So, Cassandra Reaper starts an incremental
repair while the full repair is still in process.
Does anyone know how to avoid this? Optimally, the full repair would
be paused (no new segments started) for the duration of the
incremental repair. The second best option would be inhibiting the
incremental repair while a full repair is in progress.
Best regards,
Sebastian