[ https://issues.apache.org/jira/browse/CASSANDRA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900659#action_12900659 ]
Jonathan Ellis edited comment on CASSANDRA-1072 at 8/20/10 6:17 AM: -------------------------------------------------------------------- I actually think the main problem with the 1072 approach is on the write side, not the read. Writes are fragile. Here is what I mean by that: Because 1072 "shards" the increments across multiple machines, it can tolerate _temporary_ failures. This is good. But because the shards are no longer replicas in the normal Cassandra sense -- each is responsible for a different set of increments, rather than all maintaining the same data -- there is no way to create a write CL greater than ONE, and thus, no defense against _permanent_ failures of single machines. That is, if a single machine dies at the right time, you will lose data, and unlike normal cassandra you can't prevent that by requesting that writes not be acked until a higher CL is achieved. (You can try to band-aid the problem with Sylvain's repair-on-write, but only reduces the failure window, it does not eliminate it.) A related source of fragility is that operations here are not idempotent from the client's perspective. This is why repair-on-write can't be used to simulate higher CLs -- if a client issues an increment and it comes back TimedOut (or a socket exception to the coordinator), it has no safe way to retry; if the operation did succeed, and the client issues another delete, we have added new data erroneously; if the operation did not succeed, and we do not retry, we have lost data. So here is a case where even temporary node failures can cause data corruption. (I say corruption, because I mean to distinguish it from the sort of inconsistency that RR and AES can repair. Once introduced into the system, this corruption is not distinguishable from real increments and thus un-repairable.) was (Author: jbellis): I actually think the main problem with the 1072 appraoch is on the write side, not the read. Writes are fragile. Here is what I mean by that: Because 1072 "shards" the increments across multiple machines, it can tolerate _temporary_ failures. This is good. But because the shards are no longer replicas in the normal Cassandra sense -- each is responsible for a different set of increments, rather than all maintaining the same data -- there is no way to create a write CL greater than ONE, and thus, no defense against _permanent_ failures of single machines. You can try to band-aid the problem with Sylvain's repair-on-write, but only reduces the failure window, it does not eliminate it. Another source of fragility is that operations here are not idempotent from the client's perspective. If a client issues an increment and it comes back TimedOut (or a socket exception to the coordinator), it has no safe way to retry; if the operation did succeed, and the client issues another delete, we have added new data erroneously; if the operation did not succeed, and we do not retry, we have lost data. So here is a case where even temporary node failures can cause data corruption. (I say corruption, because I mean to distinguish it from the sort of inconsistency that RR and AES can repair. Once introduced into the system, this corruption is not distinguishable from real increments and thus un-repairable.) > Increment counters > ------------------ > > Key: CASSANDRA-1072 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1072 > Project: Cassandra > Issue Type: Sub-task > Components: Core > Reporter: Johan Oskarsson > Assignee: Kelvin Kakugawa > Attachments: CASSANDRA-1072-2.patch, CASSANDRA-1072-2.patch, > CASSANDRA-1072.patch, Incrementcountersdesigndoc.pdf > > > Break out the increment counters out of CASSANDRA-580. Classes are shared > between the two features but without the plain version vector code the > changeset becomes smaller and more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.