Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/16849 )
Change subject: IMPALA-10336: Coordinator return incorrect error to client ...................................................................... Patch Set 2: So I think the underlying issue here is that the ExecCompleteCb() can be called with an ABORTED status even though the rpc was actually sent successfully. The KRPC docs are a little vague about this, but see https://github.com/apache/impala/blob/master/be/src/kudu/rpc/outbound_call.cc#L246 This means we can get an "OK" status report for a backend that had its status previously set as ABORTED. Since that backend will be 'done' (because it has an error status), we'll try to set the overall query status to ABORTED. Since the cancellation that happened that caused the rpc to get ABORTED happened because of another backend reporting an error, most of the time the error from that backend will end up set as the overall query status, but there's a race where the thread processing the report for the ABORTED backend could win and end up setting the overall query status to ABORTED first. I don't think this patch fixes that - its possible that the backend's status will be set to ABORTED prior to the call to UpdateBackendExecStatus() for it, in which case I think that ApplyExecStatusReport() will still just return ABORTED as the error in the status out parameter that you added. I think you could repro the situation if you add some artificial delays in two places: 1) have the backend that does not encounter an error sleep indefinitely before responding to the Exec() rpc but after starting execution (so that KRPC will say that it was ABORTED but it will still try to send status reports) and 2) have Coordinator::UpdateBackendExecStatus sleep indefinitely when processing the status report from the backend that reports the "real" error after it causes the other Exec rpcs to be cancelled (i.e. after the call to exec_rpcs_status_barrier_.NotifyRemaining) I think instead the solution is to ensure that if a backend has Cancel() called on it, that its status will end up as CANCELLED, and not ABORTED. This fixes the problem in the JIRA because Coordinator::UpdateExecState already has logic where it will prefer to set the overall query status to a "real" status over CANCELLED, see https://github.com/apache/impala/blob/master/be/src/runtime/coordinator.cc#L704, and its a nice invariant that a cancelled backend will have a status of Cancelled() Of course, let me know if anything above is wrong. -- To view, visit http://gerrit.cloudera.org:8080/16849 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I75f252e43006c6ff6980800e3254672de396b318 Gerrit-Change-Number: 16849 Gerrit-PatchSet: 2 Gerrit-Owner: Wenzhe Zhou <wz...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Tue, 15 Dec 2020 21:19:37 +0000 Gerrit-HasComments: No