[ https://issues.apache.org/jira/browse/IMPALA-11263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526742#comment-17526742 ]
Wenzhe Zhou edited comment on IMPALA-11263 at 4/23/22 7:38 AM: --------------------------------------------------------------- In Coordinator::BackendState::Cancel(), if exec_rpc_sent_ equals true and exec_done_ equals false (that means Coordinator::BackendState::ExecAsync() is called, but callback function Coordinator::BackendState::ExecCompleteCb() is not called), we will call RpcController::Cancel() to cancel Exec() RPC then call WaitOnExecLocked() to wait callback function Coordinator::BackendState::ExecCompleteCb() to be called. >From above log message, Coordinator::BackendState::Cancel() for the 4-th >backend hang after calling WaitOnExecLocked(). That means the callback >function was not called. Coordinator::BackendState::ExecCompleteCb() should be called if RPC is finished successfully, finished with error, cancelled, or timeout. RpcController::Cancel() schedule a cancellation task for reactor thread pool. When reactor thread execute the task with function ReactorThread::CancelOutboundCall(). The function call Connection::CancelOutboundCall() and OutboundCall::Cancel(). Connection::CancelOutboundCall() reset car->call as null so that Connection::HandleOutboundCallTimeout() will skip to call OutboundCall::SetTimedOut(). OutboundCall::Cancel() will not call OutboundCall::SetCancelled() if its state is SENDING. OutboundCall::SetCancelled() will be called until OutboundCall:SetSent() is called. So if the RPC is cancelled when OutboundCall.state_ is in SENDING state, OutboundCall::SetTimedOut() will not be called when the timeout of outbound call is handled in Connection::HandleOutboundCallTimeout(), and OutboundCall::SetCancelled() will not be called if notification of transfer finishing (CallTransferCallbacks::NotifyTransferFinished()) is not received after sending a RPC call on the wire. Coordinator::BackendState::ExecCompleteCb() will not be called if OutboundCall::SetCancelled() and OutboundCall::SetTimedOut() are not called. That means in case the CallTransferCallbacks::NotifyTransferFinished() is not called after sending a RPC call on the wire, Coordinator::BackendState::ExecCompleteCb() will not be called, which lead Coordinator::BackendState::WaitOnExecLocked() to wait indefinitely. Connection::ProcessOutboundTransfers() call OutboundCall::SetSending() to set OutboundCall::state_ as SENDING when starting transfer. It then call OutboundTransfer::SendBuffer() to send data through socket. OutboundTransfer::SendBuffer() call socket->Writev() to send data. If socket->Writev() return error, the function will return error without calling CallTransferCallbacks::NotifyTransferFinished() so OutboundCall::SetSent() will not be called. This means if socket write fails, OutboundCall.state_ will stay in SENDING state and OutboundCall::SetCancelled() will not be called. This is the case that Coordinator::BackendState::WaitOnExecLocked() wait indefinitely. was (Author: wzhou): In Coordinator::BackendState::Cancel(), if exec_rpc_sent_ equals true and exec_done_ equals false (that means Coordinator::BackendState::ExecAsync() is called, but callback function Coordinator::BackendState::ExecCompleteCb() is not called), we will call RpcController::Cancel() to cancel Exec() RPC then call WaitOnExecLocked() to wait callback function Coordinator::BackendState::ExecCompleteCb() to be called. >From above log message, Coordinator::BackendState::Cancel() for the 4-th >backend hang after calling WaitOnExecLocked(). That means the callback >function was not called. Coordinator::BackendState::ExecCompleteCb() should be called if RPC is finished successfully, finished with error, cancelled, or timeout. RpcController::Cancel() schedule a cancellation task for reactor thread pool. When reactor thread execute the task with function ReactorThread::CancelOutboundCall(). The function call Connection::CancelOutboundCall() and OutboundCall::Cancel(). Connection::CancelOutboundCall() reset car->call as null so that Connection::HandleOutboundCallTimeout() will skip to call OutboundCall::SetTimedOut(). OutboundCall::Cancel() will not call OutboundCall::SetCancelled() if its state is SENDING. OutboundCall::SetCancelled() will be called until OutboundCall:SetSent() is called. So if the RPC is cancelled when OutboundCall.state_ is in SENDING state, OutboundCall::SetTimedOut() will not be called when the timeout of outbound call is handled in Connection::HandleOutboundCallTimeout(), and OutboundCall::SetCancelled() will not be called if notification of transfer finishing is not received after sending a RPC call on the wire. Coordinator::BackendState::ExecCompleteCb() will not be called if OutboundCall::SetCancelled() and OutboundCall::SetTimedOut() are not called. That means in case the notification of transfer finishing is missing after sending a RPC call on the wire, Coordinator::BackendState::ExecCompleteCb() will not be called, which lead Coordinator::BackendState::WaitOnExecLocked() to wait indefinitely. > Coordinator hang when cancelling a query > ---------------------------------------- > > Key: IMPALA-11263 > URL: https://issues.apache.org/jira/browse/IMPALA-11263 > Project: IMPALA > Issue Type: Bug > Components: Backend > Reporter: Wenzhe Zhou > Assignee: Wenzhe Zhou > Priority: Major > > In a rare case, callback function Coordinator::BackendState::ExecCompleteCb() > was not called for the corresponding ExecQueryFInstances RPC somehow. This > caused coordinator waited indefinitely when cancelling the query. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org