[ 
https://issues.apache.org/jira/browse/CASSANDRA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407881#comment-13407881
 ] 

Sylvain Lebresne commented on CASSANDRA-4285:
---------------------------------------------

bq. Well, it's more complex than that.

I understand that but:
* with RF=1 we would still write to disk on only 1 node. So if some disk in the 
cluster has any problem, then it's enough to have one other node going down (it 
doesn't have to be another hardware failure, it could a simple OOM or anything 
really) to break the atomicity "guarantee". Granted you have to be a bit 
unlucky, but the odds are far from unimaginable imo. And that what guarantee 
are about, protecting you against being unlucky. I think RF=2 make this order 
of magnitudes more secure, and if RF=2 had big drawbacks, then ok, why not 
consider RF=1 has the default, but I don't think that's the case, quite the 
contrary even.
* as said in my previous comment, it's not only about durability. It's a 
latency issue. If you do RF=1, then each time a node dies (is upgraded or 
whatnot) you *know* that some portion of batchlog writes on the cluster will 
suffer from timeouts (even if we retry on the coordinator, the latency will 
still suffer). That's actually the main reason why I think RF=2 is a much much 
better default.
* I don't see much downside to RF=2 compared to RF=1. A little bit more network 
traffic and more CPU usage maybe, but I think those are largely outweighed by 
the advantages.

Overall I do think quite strongly that RF=1 is the wrong default (and having it 
configurable don't make it a better default). 

bq. "Failed writes leave the cluster in an unknown state" is the most frequent 
[legitimate] complaint users have about Cassandra, and one that affects 
evaluations vs master-oriented systems. We can try to educate about the 
difference between UE failure and TOE not-really-failure until we are blue in 
the face but we will continue to get hammered for it.

Let's be clear that I completely agree with that. But fixing "Failed writes 
leave the cluster in an unknown state" is fixed by fixing atomicity. And I'm 
all for fixing batch atomicity, and I even think that for CQL3 we should make 
batch be atomic by default for all the reasons you mentioned (which wouldn't 
exclude having some escape hatch like "BATCH ... APPLY WITHOUT ATOMICITY 
GUARANTEE"). But whether we do coordinator-side retry is not directly related 
imho (and so at best should be considered separatly).

To be precise, the DCL patch will add one more possibility for TOE compared to 
the current write path, and that's a TOE while writting into the DCL. First, I 
think that using RF=2 will largely mitigate the chance of getting that TOE in 
the first place as said above. But that being said we could indeed retry 
another shard if we do still get a TOE I suppose. The only thing that bothers 
me a bit is that I think it's useful that the timeout configured by the client 
be an actual timeout on the server answer, even if to say that we haven't 
achieved what asked in the time granted (and again, I'm all for returning more 
information on what a TOE means exactly, i.e. CASSANDRA-4414, so that client 
may be able to judge whether what we do have been able to achieve during that 
time is enough that he don't need to retry). However I suppose one option could 
be to try the DCL write with a smaller timeout than the client supplied one, so 
that we can do a retry while respecting the client timeo
 ut.

bq. Finally, once the batchlog write succeeds, we shouldn't have to make the 
client retry for timeouts writing to the replicas either; we can do the retry 
server-side

My point is that retrying server-side in that case would be plain wrong. On the 
write path (that's not true for read but that is a different subject), a 
timeout when writting to the replicas means that the CL *cannot* be achieved at 
the current time (counter are another exception of that, but they are a whole 
different problem). So retrying (client and server side for that matter) with 
the same CL is useless and bad. The only thing that can be improved compared to 
today is that we can say to the client that while the CL cannot be achieve we 
did persist the write on some replica, which would remove the 
retry-with-smaller-CL-because-even-if-I-can't-get-my-CL-I-want-to-make-sure-the-write-is-at-least-persisted-on-some-replicas
 most client probably do today. And that is really useful, but that is also a 
totally separate issue to that ticket (namely CASSANDRA-4414) that don't only 
apply to batches nor only to the atomic ones.

As a side note, I wouldn't be completely against discussing the possibility of 
doing some coordinator-side retry for reads, but that's a different issue :)
                
> Atomic, eventually-consistent batches
> -------------------------------------
>
>                 Key: CASSANDRA-4285
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4285
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>
> I discussed this in the context of triggers (CASSANDRA-1311) but it's useful 
> as a standalone feature as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to