Hi Sebastian,
It's better to walk down the path on which others have walked before you
and had great success, than a path that nobody has ever walked. For the
former, you know it's relatively safe and it works. The same can hardly
be said for the later.
You said it takes a week to run the full repair for your entire cluster,
not each node. Depending on the number of nodes in your cluster, each
node should take significantly less time than that unless you have RF
set to the total number of nodes. Keep in mind that you only need to
disable the auto-compaction for the duration of a full repair on each
node, not the whole cluster.
Now, you asked, how do I know is that going to be an issue or not? That
depends on a few factors, such as:
* how long does it take for each node to complete a full repair for that
node
* how many SSTables currently exist on each node (try "find
/var/lib/cassandra/data -name '*-Data.db' | wc -l")
* how frequently is the memtable getting flushed on each node
* what's the number of open file descriptors limit (see "cat
/proc/[PID]/limits" and "sysctl fs.nr_open")
If the total number of SSTables (existing, plus the number of memtable
flushes when the auto-compaction is turned off) is going to be
significantly less than half of the number of open FD limit, you'll have
nothing to worry about. Otherwise, you may want to consider temporarily
increasing the open FD limit, reduce the memtable flush frequency (e.g.
increase the memtable size or reduce the number of write requests) or
reduce the existing number of SSTables (e.g. compaction), or just take
the risk and bet on that Cassandra is not going to open all the SSTables
at the same time (not recommended).
You may be wondering, why only half of the number of open FD limit?
That's because Cassandra usually keeps both the *-Index.db and *-Data.db
files open when an SSTable is in use.
I hope that helps.
Regards,
Bowen
On 23/11/2023 23:30, Sebastian Marsching wrote:
Hi,
we are currently in the process of migrating from C* 3.11 to C* 4.1
and we want to start using incremental repairs after the upgrade has
been completed. It seems like all the really bad bugs that made using
incremental repairs dangerous in C* 3.x have been fixed in 4.x, and
for our specific workload, incremental repairs should offer a
significant performance improvement.
Therefore, I am currently devising a plan how we could migrate to
using incremental repairs. I am aware of the guide from DataStax
(https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsRepairNodesMigration.html),
but this guide is quite old and was written with C* 3.0 in mind, so I
am not sure whether this still fully applies to C* 4.x.
In addition to that, I am not sure whether this approach fits our
workload. In particular, I am wary about disabling autocompaction for
an extended period of time (if you are interested in the reasons why,
they are at the end of this e-mail).
Therefore, I am wondering whether a slighly different process might
work better for us:
1. Run a full repair (we periodically run those anyway).
2. Mark all SSTables as repaired, even though they will include data
that has not been repaired yet because it was added while the repair
process was running.
3. Run another full repair.
4. Start using incremental repairs (and the occasional full repair in
order to handle bit rot etc.).
If I understood the interactions between full repairs and incremental
repairs correctly, step 3 should repair potential inconsistencies in
the SSTables that were marked as repaired in step 2 while avoiding the
problem of overstreaming that would happen when only marking those
SSTables as repaired that already existed before step 1.
Does anyone see a flaw in this concept or has experience with a
similar scenario (migrating to incremental repairs in an environment
with high-density nodes, where a single table contains most of the data)?
I am also interested in hearing about potential problems other C*
users experienced when migrating to incremental repairs, so that we
get a better idea what to expect.
Thanks,
Sebastian
Here is the explanation why I am being cautious:
More than 95 percent of our data is stored in a single table, and we
use high density nodes (storing about 3 TB of data per node). This
means that a full repair for the whole cluster takes about a week.
The reason for this layout is that most of our data is “cold”, meaning
that it is written once, never updated, and rarely deleted or read.
However, new data is added continuously, so disabling autocompaction
for the duration of a full repair would lead to a high number of small
SSTables accumulating over the course of the week, and I am not sure
how well the cluster would handle such a situation (and the increased
load when autocompaction is enabled again).