The API page is incorrect. Cassandra only contacts enough nodes to satisfy the requested CL. https://issues.apache.org/jira/browse/CASSANDRA-4705 and https://issues.apache.org/jira/browse/CASSANDRA-2540 are relevant to the fragility that can result as you say. (Although, unless you are doing zero read repairs I would expect the dynamic snitch to steer requests away from the unresponsive node a lot faster than 30s.)
On Thu, Oct 4, 2012 at 1:25 PM, Kirk True <[email protected]> wrote: > Hi all, > > Test scenario: > > 4 nodes (.1, .2, .3, .4) > RF=3 > CL=QUORUM > 1.1.2 > > I noticed that in ReadCallback's constructor, it determines the 'blockfor' > number of 2 for RF=3, CL=QUORUM. > > According to the API page on the wiki[1] for reads at CL=QUORUM: > > Will query *all* replicas and return the record with the most recent > timestamp once it has at least a majority of replicas (N / 2 + 1) > reported. > > > However, in ReadCallback's constructor, it determines blockfor to be 2, then > calls filterEndpoints. filterEndpoints is given a list of the three > replicas, but at the very end of the method, the endpoint list to only two > replicas. Those two replicas are then used in StorageProxy to execute the > read/digest calls. So it ends up as 2 nodes, not all three as stated on the > wiki. > > In my test case, I kill a node and then immediately issue a query for a key > that has a replica on the downed node. For the live nodes in the system, it > doesn't immediately know that the other node is down yet. Rather than > contacting *all* nodes as the wiki states, the coordinator contacts only two > -- one of which is the downed node. Since it blocks for two, one of which is > down, the query times out. Attempting the read again produces the same > effect, even when trying different nodes as coordinators. I end up retrying > a few times until the failure detectors on the live nodes realize that the > node is down. > > So, the end result is that if a client attempts to read a row that has a > replica on a newly downed node, it will timeout repeatedly until the ~30 > seconds failure detector window has passed -- even though there are enough > live replicas to satisfy the request. We basically have a scenario wherein a > value is not retrievable for upwards of 30 seconds. The percentage of keys > that exhibit this possibility shrinks as the ring grows, but it's still > non-zero. > > This doesn't seem right and I'm sure I'm missing something. > > Thanks, > Kirk > > [1] http://wiki.apache.org/cassandra/API -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
