[ 
https://issues.apache.org/jira/browse/CASSANDRA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032897#comment-13032897
 ] 

Peter Schuller commented on CASSANDRA-2643:
-------------------------------------------

I realized I failed to make the possibly most important point: That you can 
indeed get short reads such that iteration will stop early. Consider 3 nodes, 
RF=3, QUORUM.

* Node A has [10,40] of columns
* Node B has [10,40] of columns
* Node C has [10,20] of column deletions for the columns that A/B has,
but does NOT have any for [21,40] because it was down when those were written

Now a client slices [10,1000] with count=11. The co-ordinating node will 
reconcile that; C's tombstones will override A/B (I'm assuming tombstones are 
later than A+B's columns), but since C is lacking the "remainder" of columns 
you don't just get some columns at lowered consistency level - you actually get 
a "short" result, and the application or high-level client will believe that 
the iteration is complete.

This was the primary reason why I said in the OP that I believed "iterating 
over columns is impossible to do reliably with QUORUM". I somehow lost this 
when re-phrasing the JIRA post a couple of times.

Note: The short read case is not something I have tested and triggered, so is 
based on extrapolation from my understanding of the code.



> read repair/reconciliation breaks slice based iteration at QUORUM
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-2643
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2643
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.5
>            Reporter: Peter Schuller
>            Priority: Critical
>         Attachments: slicetest.py
>
>
> In short, I believe iterating over columns is impossible to do reliably with 
> QUORUM due to the way reconciliation works.
> The problem is that the SliceQueryFilter is executing locally when reading on 
> a node, but no attempts seem to be made to consider limits when doing 
> reconciliation and/or read-repair (RowRepairResolver.resolveSuperset() and 
> ColumnFamily.resolve()).
> If a node slices and comes up with 100 columns, and another node slices and 
> comes up with 100 columns, some of which are unique to each side, 
> reconciliation results in > 100 columns in the result set. In this case the 
> effect is limited to "client gets more than asked for", but the columns still 
> accurately represent the range. This is easily triggered by my test-case.
> In addition to the client receiving "too many" columns, I believe some of 
> them will not be satisfying the QUORUM consistency level for the same reasons 
> as with deletions (see discussion below).
> Now, there *should* be a problem for tombstones as well, but it's more 
> subtle. Suppose A has:
>   1
>   2
>   3
>   4
>   5
>   6
> and B has:
>   1
>   del 2
>   del 3
>   del 4
>   5
>   6 
> If you now slice 1-6 with count=3 the tombstones from B will reconcile with 
> those from A - fine. So you end up getting 1,5,6 back. This made it a bit 
> difficult to trigger in a test case until I realized what was going on. At 
> first I was "hoping" to see a "short" iteration result, which would mean that 
> the process of iterating until you get a short result will cause spurious 
> "end of columns" and thus make it impossible to iterate correctly.
> So; due to 5-6 existing (and if they didn't, you legitimately reached 
> end-of-columns) we do indeed get a result of size 3 which contains 1,5 and 6. 
> However, only node B would have contributed columns 5 and 6; so there is 
> actually no QUORUM consistency on the co-ordinating node with respect to 
> these columns. If node A and C also had 5 and 6, they would not have been 
> considered.
> Am I wrong?
> In any case; using script I'm about to attach, you can trigger the 
> over-delivery case very easily:
> (0) disable hinted hand-off to avoid that interacting with the test
> (1) start three nodes
> (2) create ks 'test' with rf=3 and cf 'slicetest'
> (3) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, 
> then ctrl-c
> (4) stop node A
> (5) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, 
> then ctrl-c
> (6) start node A, wait for B and C to consider it up
> (7) ./slicetest.py hostname_of_node_A slice # make A co-ordinator though it 
> doesn't necessarily matter
> You can also pass 'delete' (random deletion of 50% of contents) or 
> 'deleterange' (delete all in [0.2,0.8]) to slicetest, but you don't trigger a 
> short read by doing that (see discussion above).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to