[ https://issues.apache.org/jira/browse/CASSANDRA-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sylvain Lebresne updated CASSANDRA-580: --------------------------------------- Attachment: 0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch I believe the attached patch (580-version-vector-wip.patch) has a problem. At CL.ZERO and CL.ONE, it doesn't replicate writes (the ones using version vectors) at all (SP.updateDestinationByClock() clears the destinationEndpoints but still returns an empty collection). This is (overly) unsafe. This certainly could be fixed by adding new WriteResponseHandler for those cases. But I believe that there is a *much* better alternative. This alternative consists in changing the write protocol (for version vector only of course) to do the following (and note that the protocol of the current patch is already different of the one for timestamps): # a node receive a write request (with version vector clock) from a client. If it's a replica for the write, goto 3) otherwise goto 2) # the node delegate the write to one replica (along with the asked CL) and then only wait for a ack of this replica before answering the client (it doesn't replicate anything) # the chosen replica apply the mutation locally first (we must do it before replication) # then it send the mutation to other replicates, waiting for how many responses are necessary to achieve asked consistency To make this more concrete, I'm attaching a patch (0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch) that implements this protocol (it all starts in SP.delegateMutateBlocking()). Small disclaimers: this should work but is not really tested (so please be nice :)). The function RowMutation.updateBeforeReplication() could safely be ignored on a first read but it would be needed if #1072 was to use this. It could also probably be slightly optimized by allowing the DelegatedRowMutationVerbHandler to handle multiple mutations at once. This is also just the protocol mentioned above, #580 would have to be rebased on top of this. Anyway, I think this alternative is superior to the one used by the currently attached #580 patch for the following reasons: * the protocol used by the current patch (write to one replica, wait for the ack and then replicate to others, which differs from what I propose in that this is done from a potentially non replica node), doesn't work for #1072 (because of potential race condition with the read repairs). The protocol I'm proposing does not suffer of this problem and (I'm quite convinced, let's hope I'm not wrong) would reconciliate #1072 with the EC model of Cassandra. This is obviously the more important point. * it is slightly faster (network-latency-wise), as we don't wait for a full round-trip to a node before starting the replication. * it more cleanly separate the protocols of timestamped writes and versionned ones (without much code duplication really). I suppose this is more a matter of opinion whether this is better or not, but at the very least it make it clearer that version vectors don't slow down nor break the other writes. I'd be happy if someone had a look at this and confirm that I'm not completely wide of the mark. If I'm not, I may be able to spare some cycle merging this idea with #580 (and #1072). > vector clock support > -------------------- > > Key: CASSANDRA-580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-580 > Project: Cassandra > Issue Type: New Feature > Components: Core > Environment: N/A > Reporter: Kelvin Kakugawa > Assignee: Kelvin Kakugawa > Fix For: 0.7.0 > > Attachments: > 0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch, > 580-1-Add-ColumnType-as-enum.patch, 580-context-v4.patch, > 580-counts-wip1.patch, 580-thrift-v3.patch, 580-thrift-v6.patch, > 580-version-vector-wip.patch > > Original Estimate: 672h > Remaining Estimate: 672h > > Allow a ColumnFamily to be versioned via vector clocks, instead of long > timestamps. Purpose: enable incr/decr; flexible conflict resolution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.