[ 
https://issues.apache.org/jira/browse/CASSANDRA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-2643:
----------------------------------------

    Attachment: short_read.sh

You are right, there is a problem here.

I'll just note that you example is not a good example for QUORUM, because the 
fact that only C "has [10,20] of column deletions" means this situation did not 
happen with QUORUM writes (and the consistency guarantee for QUORUM involves 
read and write being at QUORUM).

However, this still show that there is a problem for ONE (writes) + ALL 
(reads). And it's not hard to come up with an example for QUORUM (reads and 
writes). Just consider the case where you insert like 10 columns and then 
delete the 3 first ones but with each time 1 node down, but never the same one.

To make this concrete, I'm attaching a script that produce this "short read" 
effect. Disclaimer: it uses https://github.com/pcmanus/ccm and require the 
patch I've attached to CASSANDRA-2646 (to be able to do a bounded slice with 
the cli).

The simplest way to fix that I see (which doesn't imply simple per se) would be 
to requests more columns if we're short after a resolve on the coordinator.  
Yes in theory it means we may have to do a unknown number of such re-request, 
but in practice I strongly doubt this will be a problem. The problem has very 
little chance to happen in real life to start with (for QUORUM, my script is 
simple but implements something that has very very little change to actually 
happen in real life -- especially with HH, read repair and repair), but the 
chances that if that happens we need more that 1 re-request are ridiculously 
small.


> read repair/reconciliation breaks slice based iteration at QUORUM
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-2643
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2643
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.5
>            Reporter: Peter Schuller
>            Priority: Critical
>         Attachments: short_read.sh, slicetest.py
>
>
> In short, I believe iterating over columns is impossible to do reliably with 
> QUORUM due to the way reconciliation works.
> The problem is that the SliceQueryFilter is executing locally when reading on 
> a node, but no attempts seem to be made to consider limits when doing 
> reconciliation and/or read-repair (RowRepairResolver.resolveSuperset() and 
> ColumnFamily.resolve()).
> If a node slices and comes up with 100 columns, and another node slices and 
> comes up with 100 columns, some of which are unique to each side, 
> reconciliation results in > 100 columns in the result set. In this case the 
> effect is limited to "client gets more than asked for", but the columns still 
> accurately represent the range. This is easily triggered by my test-case.
> In addition to the client receiving "too many" columns, I believe some of 
> them will not be satisfying the QUORUM consistency level for the same reasons 
> as with deletions (see discussion below).
> Now, there *should* be a problem for tombstones as well, but it's more 
> subtle. Suppose A has:
>   1
>   2
>   3
>   4
>   5
>   6
> and B has:
>   1
>   del 2
>   del 3
>   del 4
>   5
>   6 
> If you now slice 1-6 with count=3 the tombstones from B will reconcile with 
> those from A - fine. So you end up getting 1,5,6 back. This made it a bit 
> difficult to trigger in a test case until I realized what was going on. At 
> first I was "hoping" to see a "short" iteration result, which would mean that 
> the process of iterating until you get a short result will cause spurious 
> "end of columns" and thus make it impossible to iterate correctly.
> So; due to 5-6 existing (and if they didn't, you legitimately reached 
> end-of-columns) we do indeed get a result of size 3 which contains 1,5 and 6. 
> However, only node B would have contributed columns 5 and 6; so there is 
> actually no QUORUM consistency on the co-ordinating node with respect to 
> these columns. If node A and C also had 5 and 6, they would not have been 
> considered.
> Am I wrong?
> In any case; using script I'm about to attach, you can trigger the 
> over-delivery case very easily:
> (0) disable hinted hand-off to avoid that interacting with the test
> (1) start three nodes
> (2) create ks 'test' with rf=3 and cf 'slicetest'
> (3) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, 
> then ctrl-c
> (4) stop node A
> (5) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, 
> then ctrl-c
> (6) start node A, wait for B and C to consider it up
> (7) ./slicetest.py hostname_of_node_A slice # make A co-ordinator though it 
> doesn't necessarily matter
> You can also pass 'delete' (random deletion of 50% of contents) or 
> 'deleterange' (delete all in [0.2,0.8]) to slicetest, but you don't trigger a 
> short read by doing that (see discussion above).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to