[ https://issues.apache.org/jira/browse/CASSANDRA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406777#comment-13406777 ]
Jonathan Ellis commented on CASSANDRA-4285: ------------------------------------------- We can break up implementation as follows: # Add an atomic_batch_mutate method (with the same parameters as batch_mutate) using batchlog/SimpleStrategy in CassandraServer + StorageProxy # Implement batchlog replay # Implement a custom BatchlogStrategy that ensures redundancy at RF=1 # Add CQL3 support 3 and 4 appear straightforward. 1 and 2 have some hairier corners: For the batchlog write: - We don't want to make the write path more fragile (in the sense that atomic writes will fail, where non-atomic ones would succeed). But, the batchlog shard will probably be on a different machine than the replicas. If that machine is down, we could raise UnavailableException... but better would be to try different shards until we find one whose owner is up. - Part of the goal here is to avoid forcing the client to retry on TimedOutException. So if we attempt a batchlog write that times out, we should also retry to another shard instead of propagating TOE to the client. - Corollary: we don't need to worry about batchlog hints. - Finally, once the batchlog write succeeds, we shouldn't have to make the client retry for timeouts writing to the replicas either; we can do the retry server-side. But, we can't just return success, since that would imply that we'd achieved the requested ConsistencyLevel and the data is available to be read. Instead, we should introduce a new exception (InProgressException?) to indicate that the data isn't available to read yet, but the client does not need to retry. (We could use this exception as well for normal reads, where we have at least one replica acknowledge the update in time.) - What about RF and CL for batchlog? If it's convenient, we can allow users to customize batchlog RF, but we should always use CL=1 for read and writes. (If we need to go lower level though instead of re-using the normal write path, I'm fine with hardcoding RF=1.) We don't care about "latest versions," since we're append only and if we don't see an entry on one replay attempt we'll see it on the next, and we really don't need more durability than one replica since it's only a "staging area" until it's sent out to the replicas. The main alternative would be to use the CL for the batch, in the batchlog write. I don't like that though because that's going to introduce extra latency for the batchlog write that you don't want 99% of the time. For replay: - The main difficulty is that the batchlog shard owners can't be assumed to be alive when we restart. So, we'll need to track replay status for each shard: check on startup and retry periodically until we're successful. (One advantage of restricting batchlog to RF=1 is, when a read succeeds we know we're done replaying. But if we have RF>1, then we need to retry the read indefinitely in case another replica recovered that had additional entries. > Atomic, eventually-consistent batches > ------------------------------------- > > Key: CASSANDRA-4285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4285 > Project: Cassandra > Issue Type: New Feature > Components: API, Core > Reporter: Jonathan Ellis > Assignee: Jonathan Ellis > > I discussed this in the context of triggers (CASSANDRA-1311) but it's useful > as a standalone feature as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira