[jira] [Commented] (IMPALA-11263) Coordinator hang when cancelling a query

Wenzhe Zhou (Jira) Thu, 21 Apr 2022 15:16:05 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-11263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526111#comment-17526111
 ]


Wenzhe Zhou commented on IMPALA-11263:
--------------------------------------

Log messages for failed query - dc48f399b63dd87c:43f31fd900000000. 
It was retried query for another query 6d49b7f0544e71cc:37d2486300000000.
{code:java}
2022-03-24-13-50.log:    I0324 13:52:43.108821  9754 query-driver.cc:253] 
Retrying query 6d49b7f0544e71cc:37d2486300000000 with new query id 
dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-50.log:    I0324 13:52:43.108845  9754 impala-server.cc:1352] 
Registered query query_id=dc48f399b63dd87c:43f31fd900000000 
session_id=274cff1ef21d8454:80abbc8346fdc7b1
2022-03-24-13-50.log:    I0324 13:52:43.109871  9754 impala-server.cc:1392] 
Query dc48f399b63dd87c:43f31fd900000000 has idle timeout of 10m
2022-03-24-13-50.log:    I0324 13:52:43.109925  9754 query-driver.cc:360] 
Retried query 6d49b7f0544e71cc:37d2486300000000 with new query id 
dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-50.log:    I0324 13:52:43.138975  9756 
admission-controller.cc:1879] Trying to admit 
id=dc48f399b63dd87c:43f31fd900000000 in pool_name=root.default 
executor_group_name=root.default-group-000 per_host_mem_estimate=85.28 GB 
dedicated_coord_mem_estimate=376.55 MB max_requests=-1 max_queued=200 
max_mem=48828.12 GB
{code}

The query was admitted
{code:java}
2022-03-24-13-55.log:    I0324 13:58:30.230749  9756 
admission-controller.cc:1378] Admitted queued query 
id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.243191  9756 coordinator.cc:150] Exec() 
query_id=dc48f399b63dd87c:43f31fd900000000 stmt=with data_set as (    
{code}

The query was starting to execute on 5 executors
{code:java}
2022-03-24-13-55.log:    I0324 13:58:30.252866  9756 coordinator.cc:475] 
starting execution on 5 backends for query_id=dc48f399b63dd87c:43f31fd900000000
{code}

Execution RPC failed for one backend due to network issue


{code:java}
2022-03-24-13-55.log:    I0324 13:58:30.253540   147 control-service.cc:148] 
dc48f399b63dd87c:43f31fd900000000] ExecQueryFInstances(): 
query_id=dc48f399b63dd87c:43f31fd900000000 coord=coordinator-0:27000 
#instances=1
2022-03-24-13-55.log:    I0324 13:58:30.254931 10438 query-state.cc:948] 
dc48f399b63dd87c:43f31fd900000000] Executing instance. 
instance_id=dc48f399b63dd87c:43f31fd900000000 fragment_idx=0 
per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=5W0324 13:58:30.256196 
10435 executor-group.cc:164] Executor group root.default-group-002 is 
unhealthy: 3 out of 4 are available.
2022-03-24-13-55.log:    I0324 13:58:30.274003   131 kudu-status-util.h:55] 
Exec() rpc failed: Network error: Client connection negotiation failed: client 
connection to 10.143.1.130:27010: connect: Connection refused (error 111)E0324 
13:58:30.274055   131 coordinator-backend-state.cc:190] ExecQueryFInstances rpc 
query_id=dc48f399b63dd87c:43f31fd900000000 failed: Exec() rpc failed: Network 
error: Client connection negotiation failed: client connection to 
10.143.1.130:27010: connect: Connection refused (error 111)
{code}

Since one of the backends failed to startup,  coordinator canceled the 
executions for other ones by calling Coordinator::CancelBackends().

Coordinator::CancelBackends() should call Coordinator::BackendState::Cancel() 
for 5 backends, then wrote log message "CancelBackends() 
query_id=dc48f399b63dd87c:43f31fd900000000, tried to cancel 5 backends". But I 
could not find this log message in log file, and could find 4 log messages 
wrote by Coordinator::BackendState::Cancel(). It looks the thread was hung in 
Coordinator::BackendState::Cancel() for the 4-th backend.


{code:java}
2022-03-24-13-55.log:    I0324 13:58:30.281944  9756 
coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 
target backend=10.143.10.147:27000: Sending CancelQueryFInstances rpc
2022-03-24-13-55.log:    I0324 13:58:30.282232   147 control-service.cc:219] 
CancelQueryFInstances(): query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.282265   147 query-exec-mgr.cc:126] 
QueryState: query_id=dc48f399b63dd87c:43f31fd900000000 refcnt=4
2022-03-24-13-55.log:    I0324 13:58:30.282274   147 query-state.cc:974] 
Cancel: query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.282284   147 
krpc-data-stream-mgr.cc:339] cancelling active streams for 
query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.282481  9756 
coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 
target backend=10.143.0.249:27010: Sending CancelQueryFInstances rpc
2022-03-24-13-55.log:    I0324 13:58:30.296829   534 
admission-controller.cc:1887] Stats: agg_num_running=6, agg_num_queued=4, 
agg_mem_reserved=433.15 GB,  local_host(local_mem_admitted=481.94 GB, 
num_admitted_running=6, num_queued=4, backend_mem_reserved=2.14 GB, 
topN_query_stats: queries=[a8450959eb292a8e:df40af3200000000, 
904fd49f800e6964:5aa6fc5200000000, 1e42ce4cc6ddb020:46960fab00000000, 
dc48f399b63dd87c:43f31fd900000000, 8f438f867f55c89b:efdb7c1600000000], 
total_mem_consumed=49.40 MB, fraction_of_pool_total_mem=1; pool_level_stats: 
num_running=21, min=0, max=25.27 MB, pool_total_mem=49.40 MB, 
average_per_query=2.35 MB)
2022-03-24-13-55.log:    I0324 13:58:30.297036   534 
admission-controller.cc:1887] Stats: agg_num_running=6, agg_num_queued=4, 
agg_mem_reserved=433.15 GB,  local_host(local_mem_admitted=481.94 GB, 
num_admitted_running=6, num_queued=4, backend_mem_reserved=2.14 GB, 
topN_query_stats: queries=[a8450959eb292a8e:df40af3200000000, 
904fd49f800e6964:5aa6fc5200000000, 1e42ce4cc6ddb020:46960fab00000000, 
dc48f399b63dd87c:43f31fd900000000, 8f438f867f55c89b:efdb7c1600000000], 
total_mem_consumed=49.40 MB, fraction_of_pool_total_mem=1; pool_level_stats: 
num_running=21, min=0, max=25.27 MB, pool_total_mem=49.40 MB, 
average_per_query=2.35 MB)
2022-03-24-13-55.log:    I0324 13:58:30.297185   534 
admission-controller.cc:1887] Stats: agg_num_running=6, agg_num_queued=4, 
agg_mem_reserved=433.15 GB,  local_host(local_mem_admitted=481.94 GB, 
num_admitted_running=6, num_queued=4, backend_mem_reserved=2.14 GB, 
topN_query_stats: queries=[a8450959eb292a8e:df40af3200000000, 
904fd49f800e6964:5aa6fc5200000000, 1e42ce4cc6ddb020:46960fab00000000, 
dc48f399b63dd87c:43f31fd900000000, 8f438f867f55c89b:efdb7c1600000000], 
total_mem_consumed=49.40 MB, fraction_of_pool_total_mem=1; pool_level_stats: 
num_running=21, min=0, max=25.27 MB, pool_total_mem=49.40 MB, 
average_per_query=2.35 MB)
2022-03-24-13-55.log:    I0324 13:58:30.303017 10438 
krpc-data-stream-mgr.cc:308] dc48f399b63dd87c:43f31fd900000000] 
DeregisterRecvr(): fragment_instance_id=dc48f399b63dd87c:43f31fd900000000, 
node=74
2022-03-24-13-55.log:    I0324 13:58:30.303082 10438 query-state.cc:957] 
dc48f399b63dd87c:43f31fd900000000] Instance completed. 
instance_id=dc48f399b63dd87c:43f31fd900000000 #in-flight=4 status=CANCELLED: 
Cancelled
2022-03-24-13-55.log:    I0324 13:58:30.303110 10437 query-state.cc:468] 
dc48f399b63dd87c:43f31fd900000000] UpdateBackendExecState(): last report for 
dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:30.304203 10437 query-state.cc:738] 
dc48f399b63dd87c:43f31fd900000000] Cancelling fragment instances as directed by 
the coordinator. Returned status: Cancelled
2022-03-24-13-55.log:    I0324 13:58:30.304234 10437 query-state.cc:974] 
dc48f399b63dd87c:43f31fd900000000] Cancel: 
query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:58:32.535372  9756 
coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 
target backend=10.143.1.130:27010: Not cancelling because the backend is 
already done: ExecQueryFInstances rpc 
query_id=dc48f399b63dd87c:43f31fd900000000 failed: Exec() rpc failed: Network 
error: Client connection negotiation failed: client connection to 
10.143.1.130:27010: connect: Connection refused (error 111)
2022-03-24-13-55.log:    I0324 13:58:32.535439  9756 
coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 
target backend=10.143.1.137:27010: Attempting to cancel Exec() rpc
{code}

Impala server tried to cancel query dc48f399b63dd87c:43f31fd900000000 due to 
one failed backend
{code:java}
2022-03-24-13-55.log:    I0324 13:58:39.793668   554 impala-server.cc:2274] 
Backends failed for query dc48f399b63dd87c:43f31fd900000000, adding to queue to 
check for cancellation: impala-executor-002-2:27010, impala-executor-002-3:27010
{code}

Since dc48f399b63dd87c:43f31fd900000000 was tried query, skip retry 

{code:java}
2022-03-24-13-55.log:    I0324 13:58:39.793761   538 query-driver.cc:150] 
Skipping retry of query_id=dc48f399b63dd87c:43f31fd900000000 because it has 
already been retried
{code}

But cancellation failed due to invalid query handle. 

{code:java}

2022-03-24-13-55.log:    I0324 13:58:39.793794   538 impala-server.cc:1909] 
CancelFromThreadPool(): cancelling query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-13-55.log:    I0324 13:59:11.101034   554 impala-server.cc:2274] 
Backends failed for query dc48f399b63dd87c:43f31fd900000000, adding to queue to 
check for cancellation: impala-executor-002-1:27010
2022-03-24-13-55.log:    I0324 13:59:11.101111   539 impala-server.cc:1599] 
Invalid or unknown query handle: dc48f399b63dd87c:43f31fd900000000.
2022-03-24-13-55.log:    I0324 13:59:11.101135   539 impala-server.cc:1855] 
CancelFromThreadPool(): query dc48f399b63dd87c:43f31fd900000000 already 
unregistered.
2022-03-24-14-00.log:    I0324 14:04:18.176677 11006 impala-server.cc:1436] 
UnregisterQuery(): query_id=dc48f399b63dd87c:43f31fd900000000
2022-03-24-14-00.log:    I0324 14:04:18.176694 11006 impala-server.cc:1586] 
Invalid or unknown query handle: dc48f399b63dd87c:43f31fd900000000.
{code}

For this query failure, query-retry thread was hung in 
Coordinator::BackendState::Cancel() for the 4-th backend when the coordinator 
tried to cancel retried query. The last message written by the thread is 
"Attempting to cancel Exec() rpc".  This indicates that WaitOnExecLocked() 
caused the thread waiting indefinitely. We can add timeout for 
WaitOnExecLocked() to avoid this hanging issue. 




> Coordinator hang when cancelling a query
> ----------------------------------------
>
>                 Key: IMPALA-11263
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11263
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Wenzhe Zhou
>            Assignee: Wenzhe Zhou
>            Priority: Major
>
> In a rare case, callback function Coordinator::BackendState::ExecCompleteCb() 
> was not called for the corresponding ExecQueryFInstances RPC somehow. This 
> caused coordinator waited indefinitely when cancelling the query.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-11263) Coordinator hang when cancelling a query

Reply via email to