Hi
We have a Cassandra solution with 2 DCs where each DC has  >30 nodes
From time to time we see problems with READ REPAIR, but I am stuck with the 
analysis
We have a pattern for these faults where we do

  1.  INSERT with Local Quorum (2 out of 3)
  2.  Wait for 0.5 - 1 seconds time window
  3.  READ with Local Quorum (2 out of 3)
     *   Triggers a read repair
  4.  Then we do an UPDATE …

The replication factor is 3
In my world in (1) we for sure store the data in 2 out of 3 places, and I would 
be surprised if we would not also reach the 3;rd node within 0.5 sec
So how come in (3) the read can’t get a proper response from 2 out of 3
Some are saying the problem started occurring when we added DC2, but I can’t 
understand how it could be as our query is Local Quorum and will involve only 
DC1

How can I debug this fault ?
How can I track if the data has reached all 3 nodes ?

All ideas are welcome
-Tobias


Reply via email to