[ 
https://issues.apache.org/jira/browse/CASSANDRA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035030#comment-13035030
 ] 

Peter Schuller commented on CASSANDRA-2643:
-------------------------------------------

You're right of course - my example was bogus. I'll also agree about re-try 
being reasonable under the circumstances, though perhaps not optimal.

With regards to the fix. Let me just make sure I understand you correctly. So 
given a read command with a limit N that yields <N columns 
(post-reconciliation), we may need to re-request from one or more nodes. But 
how do we distinguish between a legitimate short read and a spurious short 
read? The criteria seems to me to be, that a read is potentially spuriously 
short if "one or more of the nodes involved returned a NON-short read". If all 
of them returned short reads, it's fine; only if we have results from a node 
that we cannot prove did indeed exhaust its list of available columns do we 
need to check.

That is my understanding of your proposed solution, and that does seem doable 
on the co-ordinator side without protocol changes since we obviously know what 
we actually got from each node; it's just a matter of coding acrobatics (not 
sure how much work).

However, would you agree with this claim: This would fix the spurious short 
read problem specifically, but does not address the more general problem of 
consistency - i.e., one might receive columns that have not gone through 
reconciliation by QUORUM?

If we are to solve that, while still not implying protocol changes, I believe 
we need to do re-tries whenever a more general condition is true: That we do 
not have confirmed QUORUM for the full range implied by the start+limit range 
that we are being asked for. In other words, if one or more of the nodes 
participating in the read returned a response that satisfies:

  (1) The response was *not* short.
    AND
  (2) The response "last" column was < than the "last" column that we are to 
return post-reconciliation.

Lacking a protocol change to communicate authoritative ranges of responses, and 
given that the premise is that we *must* deliver start+limit unless there are < 
limit number of columns available, we necessarily can only consider the full 
range (first-to-last column) of a response as authoritative (except in the case 
of a short read, in which case it's authoritative to infinity).

Without revisiting the code to try to figure out what the easiest way to 
implement it is, one thought is that if you agree that a clean long-term fix 
would be to communicate authoritativeness in responses, perhaps one can at 
least make the logic to handle this compatible with that way of thinking. It's 
just that until protocol changes can happen, we'd (1) infer authoritativeness 
from columns/tombstones in the result instead of from explicit indicators in a 
response, and (2) since we cannot propagate short ranges to clients, we must 
re-request instead of cleanly return a short-but-not-eof-indicating range to 
the client.

Thoughts?

> read repair/reconciliation breaks slice based iteration at QUORUM
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-2643
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2643
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.5
>            Reporter: Peter Schuller
>            Priority: Critical
>         Attachments: short_read.sh, slicetest.py
>
>
> In short, I believe iterating over columns is impossible to do reliably with 
> QUORUM due to the way reconciliation works.
> The problem is that the SliceQueryFilter is executing locally when reading on 
> a node, but no attempts seem to be made to consider limits when doing 
> reconciliation and/or read-repair (RowRepairResolver.resolveSuperset() and 
> ColumnFamily.resolve()).
> If a node slices and comes up with 100 columns, and another node slices and 
> comes up with 100 columns, some of which are unique to each side, 
> reconciliation results in > 100 columns in the result set. In this case the 
> effect is limited to "client gets more than asked for", but the columns still 
> accurately represent the range. This is easily triggered by my test-case.
> In addition to the client receiving "too many" columns, I believe some of 
> them will not be satisfying the QUORUM consistency level for the same reasons 
> as with deletions (see discussion below).
> Now, there *should* be a problem for tombstones as well, but it's more 
> subtle. Suppose A has:
>   1
>   2
>   3
>   4
>   5
>   6
> and B has:
>   1
>   del 2
>   del 3
>   del 4
>   5
>   6 
> If you now slice 1-6 with count=3 the tombstones from B will reconcile with 
> those from A - fine. So you end up getting 1,5,6 back. This made it a bit 
> difficult to trigger in a test case until I realized what was going on. At 
> first I was "hoping" to see a "short" iteration result, which would mean that 
> the process of iterating until you get a short result will cause spurious 
> "end of columns" and thus make it impossible to iterate correctly.
> So; due to 5-6 existing (and if they didn't, you legitimately reached 
> end-of-columns) we do indeed get a result of size 3 which contains 1,5 and 6. 
> However, only node B would have contributed columns 5 and 6; so there is 
> actually no QUORUM consistency on the co-ordinating node with respect to 
> these columns. If node A and C also had 5 and 6, they would not have been 
> considered.
> Am I wrong?
> In any case; using script I'm about to attach, you can trigger the 
> over-delivery case very easily:
> (0) disable hinted hand-off to avoid that interacting with the test
> (1) start three nodes
> (2) create ks 'test' with rf=3 and cf 'slicetest'
> (3) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, 
> then ctrl-c
> (4) stop node A
> (5) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, 
> then ctrl-c
> (6) start node A, wait for B and C to consider it up
> (7) ./slicetest.py hostname_of_node_A slice # make A co-ordinator though it 
> doesn't necessarily matter
> You can also pass 'delete' (random deletion of 50% of contents) or 
> 'deleterange' (delete all in [0.2,0.8]) to slicetest, but you don't trigger a 
> short read by doing that (see discussion above).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to