Michael Ho created IMPALA-7464:
----------------------------------

             Summary: DCHECK(!released_exec_resources_) is triggered when 
ExecFInstance() RPC times out
                 Key: IMPALA-7464
                 URL: https://issues.apache.org/jira/browse/IMPALA-7464
             Project: IMPALA
          Issue Type: Bug
          Components: Distributed Exec
    Affects Versions: Impala 3.1.0
            Reporter: Michael Ho
            Assignee: Dan Hecht


Hit the following when running custom_cluster_test/test_rpc_timeout.py

{noformat}
F0817 20:04:25.451669   311 query-state.cc:469] Check failed: 
!released_exec_resources_

void QueryState::AcquireExecResourceRefcount() {
  DCHECK(!released_exec_resources_); <<----
  exec_resource_refcnt_.Add(1);
}

#0  0x00007ff07d59a1f7 in raise () from /lib64/libc.so.6
#1  0x00007ff07d59b8e8 in abort () from /lib64/libc.so.6
#2  0x00000000043dd1b4 in google::DumpStackTraceAndExit() ()
#3  0x00000000043d3c0d in google::LogMessage::Fail() ()
#4  0x00000000043d54b2 in google::LogMessage::SendToLog() ()
#5  0x00000000043d35e7 in google::LogMessage::Flush() ()
#6  0x00000000043d6bae in google::LogMessageFatal::~LogMessageFatal() ()
#7  0x0000000001e2f623 in impala::QueryState::AcquireExecResourceRefcount 
(this=0xc5c2800) at 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-state.cc:469
#8  0x0000000001e2b734 in impala::QueryState::Init (this=0xc5c2800, 
rpc_params=...) at 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-state.cc:125
#9  0x0000000001e516e1 in impala::QueryExecMgr::StartQuery (this=0xc7c55c0, 
params=...) at 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-exec-mgr.cc:52
#10 0x0000000001f6bd12 in impala::ImpalaInternalService::ExecQueryFInstances 
(this=0xd4421e0, return_val=..., params=...) at 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-internal-service.cc:51
#11 0x000000000308615c in 
impala::ImpalaInternalServiceProcessor::process_ExecQueryFInstances 
(this=0xcc38580, seqid=0, iprot=0xd3a4900, oprot=0xd3a4980, 
callContext=0xc2475c0) at 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/generated-sources/gen-cpp/ImpalaInternalService.cpp:1278
#12 0x0000000003085eaa in impala::ImpalaInternalServiceProcessor::dispatchCall 
(this=0xcc38580, iprot=0xd3a4900, oprot=0xd3a4980, fname=..., seqid=0, 
callContext=0xc2475c0) at 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/generated-sources/gen-cpp/ImpalaInternalService.cpp:1251
#13 0x0000000001913fd6 in apache::thrift::TDispatchProcessor::process 
(this=0xcc38580, in=..., out=..., connectionContext=0xc2475c0) at 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/thrift-0.9.3-p4/include/thrift/TDispatchProcessor.h:121
#14 0x0000000001d2fbe9 in apache::thrift::server::TAcceptQueueServer::Task::run 
(this=0xce010c0) at 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/rpc/TAcceptQueueServer.cpp:68
#15 0x0000000001d27ca5 in impala::ThriftThread::RunRunnable (this=0xc246500, 
runnable=..., promise=0x7fefee7b1260) at 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/rpc/thrift-thread.cc:74
#16 0x0000000001d293cb in boost::_mfi::mf2<void, impala::ThriftThread, 
boost::shared_ptr<apache::thrift::concurrency::Runnable>, 
impala::Promise<unsigned long, (impala::PromiseMode)0>*>::operator() 
(this=0xcdf1650, p=0xc246500, a1=..., a2=0x7fefee7b1260) at 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/mem_fn_template.hpp:280
#17 0x0000000001d29261 in 
boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, 
boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, 
boost::_bi::value<impala::Promise<unsigned long, (impala::PromiseMode)0>*> 
>::operator()<boost::_mfi::mf2<void, impala::ThriftThread, 
boost::shared_ptr<apache::thrift::concurrency::Runnable>, 
impala::Promise<unsigned long, (impala::PromiseMode)0>*>, boost::_bi::list0> 
(this=0xcdf1660, f=..., a=...) at 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind.hpp:392
#18 0x0000000001d28fad in boost::_bi::bind_t<void, boost::_mfi::mf2<void, 
impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, 
impala::Promise<unsigned long, (impala::PromiseMode)0>*>, 
boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, 
boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, 
boost::_bi::value<impala::Promise<unsigned long, (impala::PromiseMode)0>*> > 
>::operator() (this=0xcdf1650) at 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind_template.hpp:20
#19 0x0000000001d28ec0 in 
boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, 
boost::_mfi::mf2<void, impala::ThriftThread, 
boost::shared_ptr<apache::thrift::concurrency::Runnable>, 
impala::Promise<unsigned long, (impala::PromiseMode)0>*>, 
boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, 
boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, 
boost::_bi::value<impala::Promise<unsigned long, (impala::PromiseMode)0>*> > >, 
void>::invoke (function_obj_ptr=...) at 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/function/function_template.hpp:153
#20 0x0000000001c4cbd0 in boost::function0<void>::operator() 
(this=0x7fefe719cba0) at 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/function/function_template.hpp:767
#21 0x00000000020754db in impala::Thread::SuperviseThread(std::string const&, 
std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, 
impala::Promise<long, (impala::PromiseMode)0>*) (name=..., category=..., 
functor=..., parent_thread_info=0x7fefee7b1850, thread_started=0x7fefee7b0f30) 
at 
/data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread.cc:359
#22 0x000000000207d7fb in boost::_bi::list5<boost::_bi::value<std::string>, 
boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, 
boost::_bi::value<impala::ThreadDebugInfo*>, 
boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> 
>::operator()<void (*)(std::string const&, std::string const&, 
boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, 
(impala::PromiseMode)0>*), boost::_bi::list0>(boost::_bi::type<void>, void 
(*&)(std::string const&, std::string const&, boost::function<void ()>, 
impala::ThreadDebugInfo const*, impala::Promise<long, 
(impala::PromiseMode)0>*), boost::_bi::list0&, int) (this=0xd016fc0, 
f=@0xd016fb8: 0x2075174 <impala::Thread::SuperviseThread(std::string const&, 
std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, 
impala::Promise<long, (impala::PromiseMode)0>*)>, a=...) at 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind.hpp:525
#23 0x000000000207d71f in boost::_bi::bind_t<void, void (*)(std::string const&, 
std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, 
impala::Promise<long, (impala::PromiseMode)0>*), 
boost::_bi::list5<boost::_bi::value<std::string>, 
boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, 
boost::_bi::value<impala::ThreadDebugInfo*>, 
boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > 
>::operator()() (this=0xd016fb8) at 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind_template.hpp:20
#24 0x000000000207d6e2 in boost::detail::thread_data<boost::_bi::bind_t<void, 
void (*)(std::string const&, std::string const&, boost::function<void ()>, 
impala::ThreadDebugInfo const*, impala::Promise<long, 
(impala::PromiseMode)0>*), boost::_bi::list5<boost::_bi::value<std::string>, 
boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, 
boost::_bi::value<impala::ThreadDebugInfo*>, 
boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > > >::run() 
(this=0xd016e00) at 
/data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/thread/detail/thread.hpp:116
#25 0x00000000032ff25a in thread_proxy ()
#26 0x00007ff07d92fe25 in start_thread () from /lib64/libpthread.so.0
#27 0x00007ff07d65d34d in clone () from /lib64/libc.so.6
{noformat}

It appears that there is a race on the coordinator which occurs only when 
{{ExecFInstance()}} RPC times out due to really slow processing of the RPC on 
the coordinator itself. The assumption of the existing code is that by the time 
{{exec_rpcs_complete_barrier_}} is unblocked, all executors should have 
processed their RPCs and called QueryState::Init(). This assumption is violated 
when the executor on coordinator is slow enough to cause {{ExecFInstance()}} 
RPC to time out.

In which case, the coordinator proceeds to mark the query as failed in 
{{Coordinator::FinishBackendStartup()}} which calls {{UpdateExecState()}} which 
eventually calls {{ReleaseExecResources()}} in {{HandleExecStateTransition()}}. 
That's why the DCHECK above is triggered.


{noformat}
I0817 20:04:23.451918   309 coordinator.cc:508] ExecState: query 
id=d740459b753125a4:f74e0fd700000000 finstance=N/A on 
host=impala-ec2-centos74-m5-4xlarge-ondemand-1ebb.vpc.cloudera.com (EXECUTING 
-> ERROR) status=ExecQueryFInstances rpc 
query_id=d740459b753125a4:f74e0fd700000000 failed: RPC recv timed out: dest 
address: impala-ec2-centos74-m5-4xlarge-ondemand-1ebb.vpc.cloudera.com:22000, 
rpc: N6impala26TExecQueryFInstancesResultE
{noformat}

In release builds, the behavior would be that the executor on coordinator will 
move ahead to allocate a bunch of resources after {{ReleaseExecResources()}} 
has been called on the coordinator. It will also start a bunch of fragment 
instances only to realize latter from {{ReportExecStatus()}} RPC that the query 
should be cancelled, in which case, it calls {{ReleaseExecResource()}} again 
after the ref count reaches 0 in {{ReleaseExecResourceRefCount()}} so there 
should be no resource leak. That said, if any logic relies on the flag 
{{released_exec_resources_}}, it may get confused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to