[jira] [Commented] (CASSANDRA-4285) Atomic, eventually-consistent batches

Sylvain Lebresne (JIRA) Thu, 05 Jul 2012 06:51:39 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407105#comment-13407105
 ]


Sylvain Lebresne commented on CASSANDRA-4285:
---------------------------------------------

If I understand that correctly, only the coordinator of a given batch might be 
able to replay batches. The problem I can see with that is that if the node 
dies and you never "replace it" (i.e. bring a node with the same IP back up), 
then you might never replay some batches. Which put a strong burden on the 
operator not to screw up. Besides, the batches won't be replay until a 
replacement node is brought up, which means that even if we replay it 
ultimately, it can take an unbounded time to do it.

So I would also add a mechanism to allow other nodes to replay batches. For 
instance, when a node A detects that another node B is down, it could check 
whether it has some batches for B locally and replay them (node B will replay 
them too when he's back up but that doesn't matter).

bq. we need to retry the read indefinitely in case another replica recovered

For that too we can use the failure detector to track which node we've 
successfully checked since restart (avoids the "indefinitely" part).

bq. default RF will be 1; operators can increase if desired

I'll admit I find 1 just a bit too low for a default (especially given it'll be 
global) and I would prefer at least 2. My reasoning is that:
# RF=1 is a tad unsafe as far as durability is concerned.
# RF=1 has the problem that if the one replica you've picked might timeout. 
Even if you automatically retry another shard (which I'm not in favor of, see 
below), it will screw up the latency. RF > 1 (with CL.ONE) largely mitigate 
that issue.
# A higher RF won't be slower during the writes (it will actually be faster 
because of my preceding point) and that is really what we care about. If replay 
is a bit slower because of it, it's not a big deal (especially given that there 
will never be much to replay).

bq. Part of the goal here is to avoid forcing the client to retry on 
TimedOutException. So if we attempt a batchlog write that times out, we should 
also retry to another shard instead of propagating TOE to the client.

I think that what this ticket will provide is an extention of the atomicity 
that exists for batches to the same key to all batches, and I don't think this 
give us much more than that. So I fully expect the retry policy for clients to 
be unchanged (most of the time client applications want to retry because what 
they care about is to achieve a given consistency level, or because they care 
that the data is replicated to at least X node).

In other words, I see a timeout as saying "I haven't been able to achieve the 
requested consistency level in time". This ticket doesn't change that, it only 
makes stronger guarantee on the state of the DB in that case (which is good). 
But I don't see why that would make us start doing retry server-side.

bq. we shouldn't have to make the client retry for timeouts writing to the 
replicas either; we can do the retry server-side

Same as above, I disagree :).

bq. Instead, we should introduce a new exception (InProgressException?) to 
indicate that the data isn't available to read yet

As said above I think that this should still be a TimeoutException. However, I 
do see a point in giving more info on what that timeout means and I've opened 
for CASSANDRA-4414 for that (which I meant to do since some time anyway).  
Having suceesfully wrote to the DCL could just be one of the info we would add 
to the TimeoutException.

                
> Atomic, eventually-consistent batches
> -------------------------------------
>
>                 Key: CASSANDRA-4285
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4285
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>
> I discussed this in the context of triggers (CASSANDRA-1311) but it's useful 
> as a standalone feature as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4285) Atomic, eventually-consistent batches

Reply via email to