[ 
https://issues.apache.org/jira/browse/CASSANDRA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900659#action_12900659
 ] 

Jonathan Ellis edited comment on CASSANDRA-1072 at 8/23/10 10:37 AM:
---------------------------------------------------------------------

I actually think the main problem with the 1072 approach is on the write side, 
not the read.  Writes are fragile.  Here is what I mean by that:

Because 1072 "shards" the increments across multiple machines, it can tolerate 
_temporary_ failures.  This is good.  But because the shards are no longer 
replicas in the normal Cassandra sense -- each is responsible for a different 
set of increments, rather than all maintaining the same data -- there is no way 
to create a write CL greater than ONE, and thus, no defense against _permanent_ 
failures of single machines.  That is, if a single machine dies at the right 
time, you will lose data, and unlike normal cassandra you can't prevent that by 
requesting that writes not be acked until a higher CL is achieved.  (You can 
try to band-aid the problem with Sylvain's repair-on-write, but only reduces 
the failure window, it does not eliminate it.)

A related source of fragility is that operations here are not idempotent from 
the client's perspective.  This is why repair-on-write can't be used to 
simulate higher CLs -- if a client issues an increment and it comes back 
TimedOut (or a socket exception to the coordinator), it has no safe way to 
retry: if the operation went on to succeed later, but the client responded to 
the TimedOut by issuing another increment, we have added new data erroneously; 
but if we do not re-issue the operation, and the original operation failed 
permanently, we have lost data.  Thus, even temporary node failures can cause 
data corruption.  (I say corruption, because I mean to distinguish it from the 
sort of inconsistency that RR and AES can repair.  Once introduced into the 
system, this corruption is not distinguishable from real increments and thus 
un-repairable.)

      was (Author: jbellis):
    I actually think the main problem with the 1072 approach is on the write 
side, not the read.  Writes are fragile.  Here is what I mean by that:

Because 1072 "shards" the increments across multiple machines, it can tolerate 
_temporary_ failures.  This is good.  But because the shards are no longer 
replicas in the normal Cassandra sense -- each is responsible for a different 
set of increments, rather than all maintaining the same data -- there is no way 
to create a write CL greater than ONE, and thus, no defense against _permanent_ 
failures of single machines.  That is, if a single machine dies at the right 
time, you will lose data, and unlike normal cassandra you can't prevent that by 
requesting that writes not be acked until a higher CL is achieved.  (You can 
try to band-aid the problem with Sylvain's repair-on-write, but only reduces 
the failure window, it does not eliminate it.)

A related source of fragility is that operations here are not idempotent from 
the client's perspective.  This is why repair-on-write can't be used to 
simulate higher CLs -- if a client issues an increment and it comes back 
TimedOut (or a socket exception to the coordinator), it has no safe way to 
retry; if the operation did succeed, and the client issues another delete, we 
have added new data erroneously; if the operation did not succeed, and we do 
not retry, we have lost data.  So here is a case where even temporary node 
failures can cause data corruption.  (I say corruption, because I mean to 
distinguish it from the sort of inconsistency that RR and AES can repair.  Once 
introduced into the system, this corruption is not distinguishable from real 
increments and thus un-repairable.)
  
> Increment counters
> ------------------
>
>                 Key: CASSANDRA-1072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1072
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Johan Oskarsson
>            Assignee: Kelvin Kakugawa
>         Attachments: CASSANDRA-1072-2.patch, CASSANDRA-1072-2.patch, 
> CASSANDRA-1072.patch, CASSANDRA-1072.patch, Incrementcountersdesigndoc.pdf
>
>
> Break out the increment counters out of CASSANDRA-580. Classes are shared 
> between the two features but without the plain version vector code the 
> changeset becomes smaller and more manageable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to