[ 
https://issues.apache.org/jira/browse/CASSANDRA-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077518#comment-17077518
 ] 

Kevin Gallardo commented on CASSANDRA-15642:
--------------------------------------------

{quote}So what do you propose? Some questions to consider
{quote}
I propose waiting for {{blockFor}} (with a {{CountDownLatch}} for example in 
{{ReadCallback}}, as detailed on CASSANDRA-15543), until timeout, if it 
happens. And in case of a timeout, provide a cohesive message saying: "_These_ 
are the hosts that responded successfully, _these_ that failed, _these_ haven't 
responded before timeout." If not timeout, when all required responses have 
comeback, if failures then explain who failed and who succeeded. Which provides 
better usability than the current incomplete error messages.
{quote}How long until you're sure you've received all the responses you might 
ever receive?
{quote}
The trap! Theoretically, you can't know, it's an asynchronous distributed 
system. I am well aware of things like the FLP impossibility. In more 
seriousness what you can say though is "we haven't received anything from X 
within the given timeframe, this might indicate an issue". Now, this is my 
personal POV, but I think it's valuable to also provide that information for 
Ops to have a look into the nodes that haven't responded in time. Keeping in 
mind that this is a difference in behavior only in the case of: [some 
sub-requests fail AND some don't respond at all].

A good example of usability improvement I can mention is the case what I've 
seen on CASSANDRA-15543. The test author initially assumed that for the read 
with the schema disagreement they would get in the error message "hosts x and y 
don't have the correct schema, but host z does". But you can't expect that from 
the error atm, all you can be sure to get is "x, or y don't have the required 
schema". Usability would be improved vastly if you were able to give the 
complete view of the problem. ISTM that in this scenario, this information can 
in most cases be returned without timeout. Cannot be strongly guaranteed, 
because again, distributed async system, but usability would be improved for a 
larger amount of situations.
{quote}How would you balance the delayed responses with user requirements to 
take corrective action promptly in response to failures?
{quote}
I would argue that the user wouldn't be able to act optimally anyway with only 
partial information. In the case of "failures + timeouts" then there would be 
more value in them knowing that there is also timeouts with some nodes, instead 
of just the failure.

Thanks for the discussion

> Inconsistent failure messages on distributed queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-15642
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15642
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination
>            Reporter: Kevin Gallardo
>            Priority: Normal
>
> As a follow up to some exploration I have done for CASSANDRA-15543, I 
> realized the following behavior in both {{ReadCallback}} and 
> {{AbstractWriteHandler}}:
>  - await for responses
>  - when all required number of responses have come back: unblock the wait
>  - when a single failure happens: unblock the wait
>  - when unblocked, look to see if the counter of failures is > 1 and if so 
> return an error message based on the {{failures}} map that's been filled
> Error messages that can result from this behavior can be a ReadTimeout, a 
> ReadFailure, a WriteTimeout or a WriteFailure.
> In case of a Write/ReadFailure, the user will get back an error looking like 
> the following:
> "Failure: Received X responses, and Y failures"
> (if this behavior I describe is incorrect, please correct me)
> This causes a usability problem. Since the handler will fail and throw an 
> exception as soon as 1 failure happens, the error message that is returned to 
> the user may not be accurate.
> (note: I am not entirely sure of the behavior in case of timeouts for now)
> For example, say a request at CL = QUORUM = 3, a failed request may complete 
> first, then a successful one completes, and another fails. If the exception 
> is thrown fast enough, the error message could say 
>  "Failure: Received 0 response, and 1 failure at CL = 3"
> Which:
> 1. doesn't make a lot of sense because the CL doesn't match the number of 
> results in the message, so you end up thinking "what happened with the rest 
> of the required CL?"
> 2. the information is incorrect. We did receive a successful response, only 
> it came after the initial failure.
> From that logic, I think it is safe to assume that the information returned 
> in the error message cannot be trusted in case of a failure. Only information 
> users should extract out of it is that at least 1 node has failed.
> For a big improvement in usability, the {{ReadCallback}} and 
> {{AbstractWriteResponseHandler}} could instead wait for all responses to come 
> back before unblocking the wait, or let it timeout. This is way, the users 
> will be able to have some trust around the information returned to them.
> Additionally, an error that happens first prevents a timeout to happen 
> because it fails immediately, and so potentially it hides problems with other 
> replicas. If we were to wait for all responses, we might get a timeout, in 
> that case we'd also be able to tell wether failures have happened *before* 
> that timeout, and have a more complete diagnostic where you can't detect both 
> errors at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to