> IIRC, tombstone timestamps are written by the server, at compaction
> time. Therefore if you have RF=X, you have X different timestamps
> relative to GCGraceSeconds. I believe there was another thread about
> two weeks ago in which Sylvain detailed the problems with what you are
> proposing, when someone else asked approximately the same question.
>
Oh yes, I forgot about the thread. I assume you are talking about:
http://grokbase.com/t/cassandra/user/12ab6pbs5n/unnecessary-tombstones-transmission-during-repair-process

I think these are multiple issues that correlate with each other:

1) Repair uses the local timestamp of DeletedColumns for Merkle tree
calculation. This is what the other thread was about.
Alexey claims that this was fixed by some other commit:
https://issues.apache.org/jira/secure/attachment/12544204/CASSANDRA-4561-CS.patch
But honestly, I dont see how this solves it. I understand how Alexeys patch
a few messages before would solve it (by overriding the updateDigest method
in DeletedColumn)

2) ExpiringColumns should not be used for merkle tree calculation if they
are timed out.
I checked LazilyCompactedRow and saw that it does not exclude any timed-out
columns. It loops over all columns and calls updateDigest on them. Without
any condition. Imho ExpiringColumn.updateDigest() should check for its own
isMarkedForDelete() first before doing any digest-changes (We cannot simply
call isMarkedDelete from LazilyCompactionRow because we dont want this for
DeletedColumns).

3) Cassandra should not create tombstones for expiring columns.
I am not a 100% sure, but it looks to me like cassandra creates tombstones
for expired ExpiringColumns. This makes me wonder if we could delete
expired columns directly. The digest for a ExpiringColumn and DeletedColumn
can never match, due to the different implementations. So there will be
always a repair if compactions are not synchronous on nodes.
Imho it should be valid to delete ExpiringColumns directly, because the TTL
is given by the client and should pass on all nodes at the same time.

All together should reduce over-repair.


Merkle trees are an optimization, what they trade for this
> optimization is over-repair.
>
> (FWIW, I agree that, if possible, this particular case of over-repair
> would be nice to eliminate.)
>
Of course, rather over-repair than corrupt something.

Reply via email to