[ https://issues.apache.org/jira/browse/IMPALA-11263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526111#comment-17526111 ]
Wenzhe Zhou commented on IMPALA-11263: -------------------------------------- Log messages for failed query - dc48f399b63dd87c:43f31fd900000000. It was retried query for another query 6d49b7f0544e71cc:37d2486300000000. {code:java} 2022-03-24-13-50.log: I0324 13:52:43.108821 9754 query-driver.cc:253] Retrying query 6d49b7f0544e71cc:37d2486300000000 with new query id dc48f399b63dd87c:43f31fd900000000 2022-03-24-13-50.log: I0324 13:52:43.108845 9754 impala-server.cc:1352] Registered query query_id=dc48f399b63dd87c:43f31fd900000000 session_id=274cff1ef21d8454:80abbc8346fdc7b1 2022-03-24-13-50.log: I0324 13:52:43.109871 9754 impala-server.cc:1392] Query dc48f399b63dd87c:43f31fd900000000 has idle timeout of 10m 2022-03-24-13-50.log: I0324 13:52:43.109925 9754 query-driver.cc:360] Retried query 6d49b7f0544e71cc:37d2486300000000 with new query id dc48f399b63dd87c:43f31fd900000000 2022-03-24-13-50.log: I0324 13:52:43.138975 9756 admission-controller.cc:1879] Trying to admit id=dc48f399b63dd87c:43f31fd900000000 in pool_name=root.default executor_group_name=root.default-group-000 per_host_mem_estimate=85.28 GB dedicated_coord_mem_estimate=376.55 MB max_requests=-1 max_queued=200 max_mem=48828.12 GB {code} The query was admitted {code:java} 2022-03-24-13-55.log: I0324 13:58:30.230749 9756 admission-controller.cc:1378] Admitted queued query id=dc48f399b63dd87c:43f31fd900000000 2022-03-24-13-55.log: I0324 13:58:30.243191 9756 coordinator.cc:150] Exec() query_id=dc48f399b63dd87c:43f31fd900000000 stmt=with data_set as ( {code} The query was starting to execute on 5 executors {code:java} 2022-03-24-13-55.log: I0324 13:58:30.252866 9756 coordinator.cc:475] starting execution on 5 backends for query_id=dc48f399b63dd87c:43f31fd900000000 {code} Execution RPC failed for one backend due to network issue {code:java} 2022-03-24-13-55.log: I0324 13:58:30.253540 147 control-service.cc:148] dc48f399b63dd87c:43f31fd900000000] ExecQueryFInstances(): query_id=dc48f399b63dd87c:43f31fd900000000 coord=coordinator-0:27000 #instances=1 2022-03-24-13-55.log: I0324 13:58:30.254931 10438 query-state.cc:948] dc48f399b63dd87c:43f31fd900000000] Executing instance. instance_id=dc48f399b63dd87c:43f31fd900000000 fragment_idx=0 per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=5W0324 13:58:30.256196 10435 executor-group.cc:164] Executor group root.default-group-002 is unhealthy: 3 out of 4 are available. 2022-03-24-13-55.log: I0324 13:58:30.274003 131 kudu-status-util.h:55] Exec() rpc failed: Network error: Client connection negotiation failed: client connection to 10.143.1.130:27010: connect: Connection refused (error 111)E0324 13:58:30.274055 131 coordinator-backend-state.cc:190] ExecQueryFInstances rpc query_id=dc48f399b63dd87c:43f31fd900000000 failed: Exec() rpc failed: Network error: Client connection negotiation failed: client connection to 10.143.1.130:27010: connect: Connection refused (error 111) {code} Since one of the backends failed to startup, coordinator canceled the executions for other ones by calling Coordinator::CancelBackends(). Coordinator::CancelBackends() should call Coordinator::BackendState::Cancel() for 5 backends, then wrote log message "CancelBackends() query_id=dc48f399b63dd87c:43f31fd900000000, tried to cancel 5 backends". But I could not find this log message in log file, and could find 4 log messages wrote by Coordinator::BackendState::Cancel(). It looks the thread was hung in Coordinator::BackendState::Cancel() for the 4-th backend. {code:java} 2022-03-24-13-55.log: I0324 13:58:30.281944 9756 coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 target backend=10.143.10.147:27000: Sending CancelQueryFInstances rpc 2022-03-24-13-55.log: I0324 13:58:30.282232 147 control-service.cc:219] CancelQueryFInstances(): query_id=dc48f399b63dd87c:43f31fd900000000 2022-03-24-13-55.log: I0324 13:58:30.282265 147 query-exec-mgr.cc:126] QueryState: query_id=dc48f399b63dd87c:43f31fd900000000 refcnt=4 2022-03-24-13-55.log: I0324 13:58:30.282274 147 query-state.cc:974] Cancel: query_id=dc48f399b63dd87c:43f31fd900000000 2022-03-24-13-55.log: I0324 13:58:30.282284 147 krpc-data-stream-mgr.cc:339] cancelling active streams for query_id=dc48f399b63dd87c:43f31fd900000000 2022-03-24-13-55.log: I0324 13:58:30.282481 9756 coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 target backend=10.143.0.249:27010: Sending CancelQueryFInstances rpc 2022-03-24-13-55.log: I0324 13:58:30.296829 534 admission-controller.cc:1887] Stats: agg_num_running=6, agg_num_queued=4, agg_mem_reserved=433.15 GB, local_host(local_mem_admitted=481.94 GB, num_admitted_running=6, num_queued=4, backend_mem_reserved=2.14 GB, topN_query_stats: queries=[a8450959eb292a8e:df40af3200000000, 904fd49f800e6964:5aa6fc5200000000, 1e42ce4cc6ddb020:46960fab00000000, dc48f399b63dd87c:43f31fd900000000, 8f438f867f55c89b:efdb7c1600000000], total_mem_consumed=49.40 MB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=21, min=0, max=25.27 MB, pool_total_mem=49.40 MB, average_per_query=2.35 MB) 2022-03-24-13-55.log: I0324 13:58:30.297036 534 admission-controller.cc:1887] Stats: agg_num_running=6, agg_num_queued=4, agg_mem_reserved=433.15 GB, local_host(local_mem_admitted=481.94 GB, num_admitted_running=6, num_queued=4, backend_mem_reserved=2.14 GB, topN_query_stats: queries=[a8450959eb292a8e:df40af3200000000, 904fd49f800e6964:5aa6fc5200000000, 1e42ce4cc6ddb020:46960fab00000000, dc48f399b63dd87c:43f31fd900000000, 8f438f867f55c89b:efdb7c1600000000], total_mem_consumed=49.40 MB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=21, min=0, max=25.27 MB, pool_total_mem=49.40 MB, average_per_query=2.35 MB) 2022-03-24-13-55.log: I0324 13:58:30.297185 534 admission-controller.cc:1887] Stats: agg_num_running=6, agg_num_queued=4, agg_mem_reserved=433.15 GB, local_host(local_mem_admitted=481.94 GB, num_admitted_running=6, num_queued=4, backend_mem_reserved=2.14 GB, topN_query_stats: queries=[a8450959eb292a8e:df40af3200000000, 904fd49f800e6964:5aa6fc5200000000, 1e42ce4cc6ddb020:46960fab00000000, dc48f399b63dd87c:43f31fd900000000, 8f438f867f55c89b:efdb7c1600000000], total_mem_consumed=49.40 MB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=21, min=0, max=25.27 MB, pool_total_mem=49.40 MB, average_per_query=2.35 MB) 2022-03-24-13-55.log: I0324 13:58:30.303017 10438 krpc-data-stream-mgr.cc:308] dc48f399b63dd87c:43f31fd900000000] DeregisterRecvr(): fragment_instance_id=dc48f399b63dd87c:43f31fd900000000, node=74 2022-03-24-13-55.log: I0324 13:58:30.303082 10438 query-state.cc:957] dc48f399b63dd87c:43f31fd900000000] Instance completed. instance_id=dc48f399b63dd87c:43f31fd900000000 #in-flight=4 status=CANCELLED: Cancelled 2022-03-24-13-55.log: I0324 13:58:30.303110 10437 query-state.cc:468] dc48f399b63dd87c:43f31fd900000000] UpdateBackendExecState(): last report for dc48f399b63dd87c:43f31fd900000000 2022-03-24-13-55.log: I0324 13:58:30.304203 10437 query-state.cc:738] dc48f399b63dd87c:43f31fd900000000] Cancelling fragment instances as directed by the coordinator. Returned status: Cancelled 2022-03-24-13-55.log: I0324 13:58:30.304234 10437 query-state.cc:974] dc48f399b63dd87c:43f31fd900000000] Cancel: query_id=dc48f399b63dd87c:43f31fd900000000 2022-03-24-13-55.log: I0324 13:58:32.535372 9756 coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 target backend=10.143.1.130:27010: Not cancelling because the backend is already done: ExecQueryFInstances rpc query_id=dc48f399b63dd87c:43f31fd900000000 failed: Exec() rpc failed: Network error: Client connection negotiation failed: client connection to 10.143.1.130:27010: connect: Connection refused (error 111) 2022-03-24-13-55.log: I0324 13:58:32.535439 9756 coordinator-backend-state.cc:974] query_id=dc48f399b63dd87c:43f31fd900000000 target backend=10.143.1.137:27010: Attempting to cancel Exec() rpc {code} Impala server tried to cancel query dc48f399b63dd87c:43f31fd900000000 due to one failed backend {code:java} 2022-03-24-13-55.log: I0324 13:58:39.793668 554 impala-server.cc:2274] Backends failed for query dc48f399b63dd87c:43f31fd900000000, adding to queue to check for cancellation: impala-executor-002-2:27010, impala-executor-002-3:27010 {code} Since dc48f399b63dd87c:43f31fd900000000 was tried query, skip retry {code:java} 2022-03-24-13-55.log: I0324 13:58:39.793761 538 query-driver.cc:150] Skipping retry of query_id=dc48f399b63dd87c:43f31fd900000000 because it has already been retried {code} But cancellation failed due to invalid query handle. {code:java} 2022-03-24-13-55.log: I0324 13:58:39.793794 538 impala-server.cc:1909] CancelFromThreadPool(): cancelling query_id=dc48f399b63dd87c:43f31fd900000000 2022-03-24-13-55.log: I0324 13:59:11.101034 554 impala-server.cc:2274] Backends failed for query dc48f399b63dd87c:43f31fd900000000, adding to queue to check for cancellation: impala-executor-002-1:27010 2022-03-24-13-55.log: I0324 13:59:11.101111 539 impala-server.cc:1599] Invalid or unknown query handle: dc48f399b63dd87c:43f31fd900000000. 2022-03-24-13-55.log: I0324 13:59:11.101135 539 impala-server.cc:1855] CancelFromThreadPool(): query dc48f399b63dd87c:43f31fd900000000 already unregistered. 2022-03-24-14-00.log: I0324 14:04:18.176677 11006 impala-server.cc:1436] UnregisterQuery(): query_id=dc48f399b63dd87c:43f31fd900000000 2022-03-24-14-00.log: I0324 14:04:18.176694 11006 impala-server.cc:1586] Invalid or unknown query handle: dc48f399b63dd87c:43f31fd900000000. {code} For this query failure, query-retry thread was hung in Coordinator::BackendState::Cancel() for the 4-th backend when the coordinator tried to cancel retried query. The last message written by the thread is "Attempting to cancel Exec() rpc". This indicates that WaitOnExecLocked() caused the thread waiting indefinitely. We can add timeout for WaitOnExecLocked() to avoid this hanging issue. > Coordinator hang when cancelling a query > ---------------------------------------- > > Key: IMPALA-11263 > URL: https://issues.apache.org/jira/browse/IMPALA-11263 > Project: IMPALA > Issue Type: Bug > Components: Backend > Reporter: Wenzhe Zhou > Assignee: Wenzhe Zhou > Priority: Major > > In a rare case, callback function Coordinator::BackendState::ExecCompleteCb() > was not called for the corresponding ExecQueryFInstances RPC somehow. This > caused coordinator waited indefinitely when cancelling the query. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org