[ 
https://issues.apache.org/jira/browse/IMPALA-11263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526742#comment-17526742
 ] 

Wenzhe Zhou edited comment on IMPALA-11263 at 4/23/22 7:38 AM:
---------------------------------------------------------------

In Coordinator::BackendState::Cancel(), if exec_rpc_sent_ equals true and 
exec_done_ equals false (that means Coordinator::BackendState::ExecAsync() is 
called, but callback function  Coordinator::BackendState::ExecCompleteCb() is 
not called), we will call RpcController::Cancel() to cancel Exec() RPC then 
call WaitOnExecLocked() to wait callback function 
Coordinator::BackendState::ExecCompleteCb() to be called. 
>From above log message,  Coordinator::BackendState::Cancel() for the 4-th 
>backend hang after calling WaitOnExecLocked(). That means the callback 
>function was not called.
Coordinator::BackendState::ExecCompleteCb() should be called if RPC is finished 
successfully, finished with error, cancelled, or timeout.

RpcController::Cancel() schedule a cancellation task for reactor thread pool. 
When reactor thread execute the task with function 
ReactorThread::CancelOutboundCall(). The function call 
Connection::CancelOutboundCall() and OutboundCall::Cancel().  
Connection::CancelOutboundCall() reset car->call as null so that 
Connection::HandleOutboundCallTimeout() will skip to call 
OutboundCall::SetTimedOut().  OutboundCall::Cancel() will not call 
OutboundCall::SetCancelled() if its state is SENDING. 
OutboundCall::SetCancelled() will be called until OutboundCall:SetSent() is 
called. So if the RPC is cancelled when OutboundCall.state_ is in SENDING 
state,  OutboundCall::SetTimedOut() will not be called when the timeout of 
outbound call  is handled in Connection::HandleOutboundCallTimeout(), and 
OutboundCall::SetCancelled() will not be called if notification of transfer 
finishing (CallTransferCallbacks::NotifyTransferFinished()) is not received 
after sending a RPC call on the wire.
Coordinator::BackendState::ExecCompleteCb() will not be called if 
OutboundCall::SetCancelled() and OutboundCall::SetTimedOut() are not called.
That means in case the CallTransferCallbacks::NotifyTransferFinished() is not 
called after sending a RPC call on the wire, 
Coordinator::BackendState::ExecCompleteCb() will not be called, which lead 
Coordinator::BackendState::WaitOnExecLocked() to wait indefinitely. 

Connection::ProcessOutboundTransfers() call OutboundCall::SetSending() to set 
OutboundCall::state_ as SENDING when starting transfer. It then call  
OutboundTransfer::SendBuffer() to send data through socket. 
OutboundTransfer::SendBuffer() call socket->Writev() to send data. If 
socket->Writev() return error, the function will return error without calling 
CallTransferCallbacks::NotifyTransferFinished() so OutboundCall::SetSent() will 
not be called.
This means if socket write fails, OutboundCall.state_ will stay in SENDING 
state and OutboundCall::SetCancelled() will not be called. 
This is the case that Coordinator::BackendState::WaitOnExecLocked() wait 
indefinitely.


was (Author: wzhou):
In Coordinator::BackendState::Cancel(), if exec_rpc_sent_ equals true and 
exec_done_ equals false (that means Coordinator::BackendState::ExecAsync() is 
called, but callback function  Coordinator::BackendState::ExecCompleteCb() is 
not called), we will call RpcController::Cancel() to cancel Exec() RPC then 
call WaitOnExecLocked() to wait callback function 
Coordinator::BackendState::ExecCompleteCb() to be called. 
>From above log message,  Coordinator::BackendState::Cancel() for the 4-th 
>backend hang after calling WaitOnExecLocked(). That means the callback 
>function was not called.
Coordinator::BackendState::ExecCompleteCb() should be called if RPC is finished 
successfully, finished with error, cancelled, or timeout.

RpcController::Cancel() schedule a cancellation task for reactor thread pool. 
When reactor thread execute the task with function 
ReactorThread::CancelOutboundCall(). The function call 
Connection::CancelOutboundCall() and OutboundCall::Cancel().  
Connection::CancelOutboundCall() reset car->call as null so that 
Connection::HandleOutboundCallTimeout() will skip to call 
OutboundCall::SetTimedOut().  OutboundCall::Cancel() will not call 
OutboundCall::SetCancelled() if its state is SENDING. 
OutboundCall::SetCancelled() will be called until OutboundCall:SetSent() is 
called. So if the RPC is cancelled when OutboundCall.state_ is in SENDING 
state,  OutboundCall::SetTimedOut() will not be called when the timeout of 
outbound call  is handled in Connection::HandleOutboundCallTimeout(), and 
OutboundCall::SetCancelled() will not be called if notification of transfer 
finishing is not received after sending a RPC call on the wire.
Coordinator::BackendState::ExecCompleteCb() will not be called if 
OutboundCall::SetCancelled() and OutboundCall::SetTimedOut() are not called.
That means in case the notification of transfer finishing is missing after 
sending a RPC call on the wire, Coordinator::BackendState::ExecCompleteCb() 
will not be called, which lead Coordinator::BackendState::WaitOnExecLocked() to 
wait indefinitely. 


> Coordinator hang when cancelling a query
> ----------------------------------------
>
>                 Key: IMPALA-11263
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11263
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Wenzhe Zhou
>            Assignee: Wenzhe Zhou
>            Priority: Major
>
> In a rare case, callback function Coordinator::BackendState::ExecCompleteCb() 
> was not called for the corresponding ExecQueryFInstances RPC somehow. This 
> caused coordinator waited indefinitely when cancelling the query.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to