> I see. It explains why I get 85G + 85G instead of 90G. But after next
> repair I have six extra files 75G each,
> how is it possible?

Maybe you've run repair on other nodes? Basically repair is a fairly
blind process. If it consider that a given range (and by range I mean
here the ones that repair hashes, so they're suppose to be small
ranges) is inconsistent between two peers, it doesn't know which peers
is up to date (and in fact it cannot, it is possible that no node is
more up to date than the other on the whole range, even if said range
is small, at least in theory).

> It looks like repair is done per sstable, not CF. Is it possible?

No, repair is done on the CF, not on individual sstables.


> Does Merklee tree calculation algorithm use sstables flushed on hard drive or 
> it uses mem tables also?

It triggers a flush, wait for it, and then use the on-disk sstables.
So in theory, since the order to flush is sent to each replica at the
same time, all the replica will trigger the flush in a very short
interval and thus consider data set that only from that short
interval. So in a cluster with a high write load, it is expected that
a bit of the inconsistence is due to that short interval, but it
should be relatively negligible. That's the theory at least. But
clearly your case (ordered partitioner with intensive update covering
the whole range) does maximize that inconsistency. Still, it shouldn't
be that dramatic.

However your use of the ordered partitionner might not be
insignificant as it's much less used and repair does have a few
specific bits for it. Do you mind opening a ticket on jira with a
summary of your configuration/problem? I'll look if I can spot
something wrong relating to the order partitioner but in any that'll
be simpler to track what's wrong there.

--
Sylvain

Reply via email to