> Is this a feature or a bug?

Neither really. Repair doesn't do any gcable tombstone collection and
it would be really hard to change that (besides, it's not his job). So
if you when you run repair there is sstable with tombstone that could
be collected but are not yet, then yes, they will be streamed. Now the
theory is that compaction will run often enough that gcable tombstone
will be collected in a reasonably timely fashion and so you will never
have lots of such tombstones in general (making the fact that repair
stream them largely irrelevant). That being said, in practice, I don't
doubt that there is a few scenario like your own where this still can
lead to doing too much useless work.

I believe the main problem is that size tiered compaction has a
tendency to not compact the largest sstables very often. Meaning that
you could have large sstable with mostly gcable tombstone sitting
around. In the upcoming Cassandra 1.2,
https://issues.apache.org/jira/browse/CASSANDRA-3442 will fix that.
Until then, if you are no afraid of a little bit of scripting, one
option could be before running a repair to run a small script that
would check the creation time of your sstable. If an sstable is old
enough (for some value of that that depends on what is the TTL you use
on all your columns), you may want to force a compaction (using the
JMX call forceUserDefinedCompaction()) of that sstable. The goal being
to get read of a maximum of outdated tombstones before running the
repair (you could also alternatively run a major compaction prior to
the repair, but major compactions have a lot of nasty effect so I
wouldn't recommend that a priori).

--
Sylvain

Reply via email to