Thanks for the response.

We do break up repairs between tables, we also tried our best to have no
overlap between repair runs. Each repair has 10000 segments (purely
arbitrary number, seemed to help at the time). Some runs have an
intensity of 0.4, some have as low as 0.05.

Still, sometimes one particular app (which does a lot of
read/modify/write batches in quorum) gets slowed down to the point we
have to stop the repair run.

But more annoyingly, since 2 to 3 weeks as I said, it looks like runs
don't progress after some time. Every time I restart reaper, it starts
to repair correctly again, up until it gets stuck. I have no idea why
that happens now, but it means I have to baby sit reaper, and it's
becoming annoying.

Thanks for the suggestion about incremental repairs. It would probably
be a good thing but it's a little challenging to setup I think. Right
now running a full repair of all keyspaces (via nodetool repair) is
going to take a lot of time, probably like 5 days or more. We were never
able to run one to completion. I'm not sure it's a good idea to disable
autocompaction for that long.

But maybe I'm wrong. Is it possible to use incremental repairs on some
column family only ?


On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:
> Hi Vincent,
>
> most people handle repair with :
> - pain (by hand running nodetool commands)
> - cassandra range repair :
>   https://github.com/BrianGallew/cassandra_range_repair
> - Spotify Reaper
> - and OpsCenter repair service for DSE users
>
> Reaper is a good option I think and you should stick to it. If it
> cannot do the job here then no other tool will.
>
> You have several options from here :
>  * Try to break up your repair table by table and see which ones
>    actually get stuck
>  * Check your logs for any repair/streaming error
>  * Avoid repairing everything :
>    * you may have expendable tables
>    * you may have TTLed only tables with no deletes, accessed with
>      QUORUM CL only
>  * You can try to relieve repair pressure in Reaper by lowering repair
>    intensity (on the tables that get stuck)
>  * You can try adding steps to your repair process by putting a higher
>    segment count in reaper (on the tables that get stuck)
>  * And lastly, you can turn to incremental repair. As you're familiar
>    with Reaper already, you might want to take a look at our Reaper
>    fork that handles incremental repair :
>    https://github.com/thelastpickle/cassandra-reaper If you go down
>    that way, make sure you first mark all sstables as repaired before
>    you run your first incremental repair, otherwise you'll end up in
>    anticompaction hell (bad bad place) :
>    
> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>    Even if people say that's not necessary anymore, it'll save you
>    from a very bad first experience with incremental repair.
>    Furthermore, make sure you run repair daily after your first inc
>    repair run, in order to work on small sized repairs.
>
> Cheers,
>
>
> On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann
> <m...@vrischmann.me> wrote:
>> __
>> Hi,
>>
>> we have two Cassandra 2.1.15 clusters at work and are having some
>> trouble with repairs.
>>
>> Each cluster has 9 nodes, and the amount of data is not gigantic but
>> some column families have 300+Gb of data.
>> We tried to use `nodetool repair` for these tables but at the time we
>> tested it, it made the whole cluster load too much and it impacted
>> our production apps.
>>
>> Next we saw https://github.com/spotify/cassandra-reaper , tried it
>> and had some success until recently. Since 2 to 3 weeks it never
>> completes a repair run, deadlocking itself somehow.
>>
>> I know DSE includes a repair service but I'm wondering how do other
>> Cassandra users manage repairs ?
>>
>> Vincent.
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com[1]


Links:

  1. http://www.thelastpickle.com/

Reply via email to