[ https://issues.apache.org/jira/browse/IMPALA-6788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518547#comment-16518547 ]
Dan Hecht commented on IMPALA-6788: ----------------------------------- This has turned out a bit more complicated than hoped without KRPC. It's possible to do but requires doing moving the {{HandleExecState()}} code path to an async thread to avoid deadlocking the exec rpcs threads (since HandleExecState() can only happen once all exec rpcs are either completed or cancelled). Let's revisit this after converting the Coordinator control RPCs to KRPC since having async RPCs (which means we may not need the exec rpc thread pool) and cancellable RPCs will enable different solutions to this problem. > Query fragments can spend lots of time starting up then fail right after > "starting" all backends > ------------------------------------------------------------------------------------------------ > > Key: IMPALA-6788 > URL: https://issues.apache.org/jira/browse/IMPALA-6788 > Project: IMPALA > Issue Type: Sub-task > Components: Distributed Exec > Affects Versions: Impala 2.12.0 > Reporter: Mostafa Mokhtar > Assignee: Dan Hecht > Priority: Major > Labels: krpc, rpc > Attachments: connect_thread_busy_queries_failing.txt, > impalad.va1007.foo.com.impala.log.INFO.20180401-200453.1800807.zip > > > Logs from a large cluster show that query startup can take a long time, then > once the startup completes the query is cancelled, this is because one of the > intermediate rpcs failed. > Not clear what the right answer is as fragments are started asynchronously, > possibly a timeout? > {code} > I0401 21:25:30.776803 1830900 coordinator.cc:99] Exec() > query_id=334cc7dd9758c36c:ec38aeb400000000 stmt=with customer_total_return as > I0401 21:25:30.813993 1830900 coordinator.cc:357] starting execution on 644 > backends for query_id=334cc7dd9758c36c:ec38aeb400000000 > I0401 21:29:58.406466 1830900 coordinator.cc:370] started execution on 644 > backends for query_id=334cc7dd9758c36c:ec38aeb400000000 > I0401 21:29:58.412132 1830900 coordinator.cc:896] Cancel() > query_id=334cc7dd9758c36c:ec38aeb400000000 > I0401 21:29:59.188817 1830900 coordinator.cc:906] CancelBackends() > query_id=334cc7dd9758c36c:ec38aeb400000000, tried to cancel 643 backends > I0401 21:29:59.189177 1830900 coordinator.cc:1092] Release admission control > resources for query_id=334cc7dd9758c36c:ec38aeb400000000 > {code} > {code} > I0401 21:23:48.218379 1830386 coordinator.cc:99] Exec() > query_id=e44d553b04d47cfb:28f06bb800000000 stmt=with customer_total_return as > I0401 21:23:48.270226 1830386 coordinator.cc:357] starting execution on 640 > backends for query_id=e44d553b04d47cfb:28f06bb800000000 > I0401 21:29:58.402195 1830386 coordinator.cc:370] started execution on 640 > backends for query_id=e44d553b04d47cfb:28f06bb800000000 > I0401 21:29:58.403818 1830386 coordinator.cc:896] Cancel() > query_id=e44d553b04d47cfb:28f06bb800000000 > I0401 21:29:59.255903 1830386 coordinator.cc:906] CancelBackends() > query_id=e44d553b04d47cfb:28f06bb800000000, tried to cancel 639 backends > I0401 21:29:59.256251 1830386 coordinator.cc:1092] Release admission control > resources for query_id=e44d553b04d47cfb:28f06bb800000000 > {code} > Checked the coordinator and threads appear to be spending lots of time > waiting on exec_complete_barrier_ > {code} > #0 0x00007fd928c816d5 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x0000000001222944 in impala::Promise<bool>::Get() () > #2 0x0000000001220d7b in impala::Coordinator::StartBackendExec() () > #3 0x0000000001221c87 in impala::Coordinator::Exec() () > #4 0x0000000000c3a925 in > impala::ClientRequestState::ExecQueryOrDmlRequest(impala::TQueryExecRequest > const&) () > #5 0x0000000000c41f7e in > impala::ClientRequestState::Exec(impala::TExecRequest*) () > #6 0x0000000000bff597 in > impala::ImpalaServer::ExecuteInternal(impala::TQueryCtx const&, > std::shared_ptr<impala::ImpalaServer::SessionState>, bool*, > std::shared_ptr<impala::ClientRequestState>*) () > #7 0x0000000000c061d9 in impala::ImpalaServer::Execute(impala::TQueryCtx*, > std::shared_ptr<impala::ImpalaServer::SessionState>, > std::shared_ptr<impala::ClientRequestState>*) () > #8 0x0000000000c561c5 in impala::ImpalaServer::query(beeswax::QueryHandle&, > beeswax::Query const&) () > /StartBackendExec > #11 0x0000000000d60c9a in boost::detail::thread_data<boost::_bi::bind_t<void, > void (*)(std::string const&, std::string const&, boost::function<void ()>, > impala::ThreadDebugInfo const*, impala::Promise<long>*), > boost::_bi::list5<boost::_bi::value<std::string>, > boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, > boost::_bi::value<impala::ThreadDebugInfo*>, > boost::_bi::value<impala::Promise<long>*> > > >::run() () > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org