[ 
https://issues.apache.org/jira/browse/CASSANDRA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407628#comment-13407628
 ] 

Jonathan Ellis commented on CASSANDRA-4285:
-------------------------------------------

bq. only the coordinator of a given batch might be able to replay batches

Right, my (unspecified so far) assumption was we would provide a way to assume 
responsibility for a removed coordinator's orphaned entries on 
removetoken/decommission/replacetoken.  

bq. when a node A detects that another node B is down, it could check whether 
it has some batches for B locally and replay them 

That makes sense, that would be a lot more timely than waiting to replace B.  
If we have B delete these when it's done we might not even need to worry about 
removetoken et al.

bq. I find 1 just a bit too low for a default

Well, it's more complex than that.  If we used the BacklogStrategy proposed 
above, it's really "RF=1+", because we need (1) the coordinator to go down 
before it can replicate out to the actual data replicas and (2) the backlog 
shard host to die unrecoverably, to lose data.  So there is a much more narrow 
window in which hardware failure can cause data loss, than in a traditional 
RF=1 case where if we lose that node at any time from now on, we lose data.  So 
we need at least RF=2 in that case for redundancy, since there is no 
alternative.  But in the backlog case both the coordinator and the client 
provide redundancy, across a small window of vulnerability.

That said, I think if we just use the normal SP read path for replay purposes, 
we get arbitrary RF support automatically, so I don't think using 1 as a 
default and allowing it to be tuned as desired will be a problem.

bq. I fully expect the retry policy for clients to be unchanged

I think this is (a) an important improvement addressing (b) a significant pain 
among people who have actually looked close enough to realize what the 
"official" policy is today, and (c) one that we can fix without much difficulty 
in the context of atomic batches.  We use the local commitlog for both 
atomicity and durability; we can use the distributed batchlog in the same way.  
(I note in passing that Megastore uses Bigtable rows as a distributed 
transaction log in a similar fashion.)

"Failed writes leave the cluster in an unknown state" is the most frequent 
[legitimate] complaint users have about Cassandra, and one that affects 
evaluations vs master-oriented systems.  We can try to educate about the 
difference between UE failure and TOE not-really-failure until we are blue in 
the face but we will continue to get hammered for it.

The standard answer to "just retry the operation" isn't really adequate, 
either.  If part of a batch times out, and then the client dies before retry is 
successful, then we will have no way to recover from the inconsistency (in the 
ACID sense, not the CAP sense).

Thus, Hector has implemented a client-side commitlog, which helps, but this is 
neither something every client should need to reimplement, nor is it as durable 
as what we can provide on the server (since it's always effectively RF=1), nor 
do we expect client machines to be as robust as the ones in the Cassandra 
cluster.

Now, we cannot eliminate TOE 100% of the time, but we can come very very close 
-- with the approach I outlined, the only case we need to hand back a TOE for 
is if the coordinator attempts a backlog write to a believed-to-be-up node, the 
write fails, and then the coordinator gets partitioned off so that there are no 
other live backlog targets available.  Since we cannot continue, and we don't 
know if the original attempt succeeded or not, we have to return TOE.  So we 
will (1) dramatically reduce TOE, and (2) the TOE we do hand back will not 
cause inconsistency if the client dies before it can retry.

I understand the argument that (2) is the really important part, but again, we 
can deliver (1) without significantly more effort, so I think it's worth doing. 
 It's the difference between "you should still implement client-side retry 
after each op in case of failure" and "you can probably ignore this the way you 
do today with the chance of your Oracle installation failing before it gives 
you the answer."
                
> Atomic, eventually-consistent batches
> -------------------------------------
>
>                 Key: CASSANDRA-4285
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4285
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>
> I discussed this in the context of triggers (CASSANDRA-1311) but it's useful 
> as a standalone feature as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to