[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Thomas Tauber-Marshall (Code Review) Tue, 15 Dec 2020 13:19:49 -0800

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16849 )


Change subject: IMPALA-10336: Coordinator return incorrect error to client
......................................................................


Patch Set 2:

So I think the underlying issue here is that the ExecCompleteCb() can be called 
with an ABORTED status even though the rpc was actually sent successfully. The 
KRPC docs are a little vague about this, but see 
https://github.com/apache/impala/blob/master/be/src/kudu/rpc/outbound_call.cc#L246

This means we can get an "OK" status report for a backend that had its status 
previously set as ABORTED. Since that backend will be 'done' (because it has an 
error status), we'll try to set the overall query status to ABORTED.

Since the cancellation that happened that caused the rpc to get ABORTED 
happened because of another backend reporting an error, most of the time the 
error from that backend will end up set as the overall query status, but 
there's a race where the thread processing the report for the ABORTED backend 
could win and end up setting the overall query status to ABORTED first.

I don't think this patch fixes that - its possible that the backend's status 
will be set to ABORTED prior to the call to UpdateBackendExecStatus() for it, 
in which case I think that ApplyExecStatusReport() will still just return 
ABORTED as the error in the status out parameter that you added.

I think you could repro the situation if you add some artificial delays in two 
places: 1) have the backend that does not encounter an error sleep indefinitely 
before responding to the Exec() rpc but after starting execution (so that KRPC 
will say that it was ABORTED but it will still try to send status reports) and 
2) have Coordinator::UpdateBackendExecStatus sleep indefinitely when processing 
the status report from the backend that reports the "real" error after it 
causes the other Exec rpcs to be cancelled (i.e. after the call to 
exec_rpcs_status_barrier_.NotifyRemaining)

I think instead the solution is to ensure that if a backend has Cancel() called 
on it, that its status will end up as CANCELLED, and not ABORTED. This fixes 
the problem in the JIRA because Coordinator::UpdateExecState already has logic 
where it will prefer to set the overall query status to a "real" status over 
CANCELLED, see 
https://github.com/apache/impala/blob/master/be/src/runtime/coordinator.cc#L704,
 and its a nice invariant that a cancelled backend will have a status of 
Cancelled()

Of course, let me know if anything above is wrong.


--
To view, visit http://gerrit.cloudera.org:8080/16849
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318
Gerrit-Change-Number: 16849
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Dec 2020 21:19:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10336: Coordinator return incorrect error to client

Reply via email to