[ https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808278#comment-16808278 ]
Joe McDonnell commented on IMPALA-8322: --------------------------------------- Ok, continuing below that level. The thread in ThreadResourceMgr::DestroyPool() is in HdfsScanNode::ThreadTokenAvailableCb(), waiting on HdfsScanNode's lock_. It looks like the thread holding that lock is in HdfsScanNode::SetError(): {noformat} 0 impalad!sys_futex [linux_syscall_support.h : 2713 + 0x24] 1 impalad!base::internal::SpinLockDelay(int volatile*, int, int) [spinlock_linux-inl.h : 88 + 0x19] 2 impalad!base::SpinLock::SlowLock() [spinlock.cc : 135 + 0x18] 3 impalad!base::SpinLock::Lock() [spinlock.h : 75 + 0xc] 4 impalad!impala::SpinLock::lock() [spinlock.h : 34 + 0xc] 5 impalad!boost::unique_lock<impala::SpinLock>::lock() [lock_types.hpp : 346 + 0xf] 6 impalad!boost::unique_lock<impala::SpinLock>::unique_lock(impala::SpinLock&) [lock_types.hpp : 124 + 0xc] 7 impalad!impala::io::ScanRange::CancelInternal(impala::Status const&, bool) [scan-range.cc : 353 + 0x32] 8 impalad!impala::io::RequestContext::Cancel() [request-context.cc : 280 + 0x1b] 9 impalad!impala::HdfsScanNode::SetDoneInternal(impala::Status const&) [hdfs-scan-node.cc : 554 + 0x1a] 10 impalad!impala::HdfsScanNode::SetError(impala::Status const&) [hdfs-scan-node.cc : 566 + 0x13] 11 impalad!impala::HdfsScanNode::ProcessSplit(std::vector<impala::FilterContext, std::allocator<impala::FilterContext> > const&, impala::MemPool*, impala::io::ScanRange*, long*) [hdfs-scan-node.cc : 538 + 0x19] 12 impalad!impala::HdfsScanNode::ScannerThread(bool, long) [hdfs-scan-node.cc : 418 + 0x2a] 13 impalad!impala::HdfsScanNode::ThreadTokenAvailableCb(impala::ThreadResourcePool*)::{lambda()#1}::operator()() const + 0x30 ...{noformat} That thread is waiting in ScanRange::CancelInternal() trying to lock HdfsFileReader lock_. HdfsFileReader holds this lock during Open(), so at that point we are waiting on opening an S3 file handle. This could take a while, and that also explains why this only shows up on S3. > S3 tests encounter "timed out waiting for receiver fragment instance" > --------------------------------------------------------------------- > > Key: IMPALA-8322 > URL: https://issues.apache.org/jira/browse/IMPALA-8322 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 3.3.0 > Reporter: Joe McDonnell > Priority: Blocker > Labels: broken-build > Attachments: fb5b9729-2d7a-4590-ea365b87-d2ead75e.dmp_dumped, > run_tests_swimlane.json.gz > > > This has been seen multiple times when running s3 tests: > {noformat} > query_test/test_join_queries.py:57: in test_basic_joins > self.run_test_case('QueryTest/joins', new_vector) > common/impala_test_suite.py:472: in run_test_case > result = self.__execute_query(target_impalad_client, query, user=user) > common/impala_test_suite.py:699: in __execute_query > return impalad_client.execute(query, user=user) > common/impala_connection.py:174: in execute > return self.__beeswax_client.execute(sql_stmt, user=user) > beeswax/impala_beeswax.py:183: in execute > handle = self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:360: in __execute_query > self.wait_for_finished(handle) > beeswax/impala_beeswax.py:381: in wait_for_finished > raise ImpalaBeeswaxException("Query aborted:" + error_log, None) > E ImpalaBeeswaxException: ImpalaBeeswaxException: > E Query aborted:Sender 127.0.0.1 timed out waiting for receiver fragment > instance: 6c40d992bb87af2f:0ce96e5d00000007, dest node: 4{noformat} > This is related to IMPALA-6818. On a bad run, there are various time outs in > the impalad logs: > {noformat} > I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 > timed out waiting for receiver fragment instance: > ef4a5dc32a6565bd:a8720b8500000007, dest node: 5 > I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call > impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id > 14881) took 120182ms. Request Metrics: {} > I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 > timed out waiting for receiver fragment instance: > d148d83e11a4603d:54dc35f700000004, dest node: 3 > I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call > impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id > 14880) took 123097ms. Request Metrics: {} > ... various messages ... > I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: > CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT) > ... various messages ... > W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call > impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request > call id 202) took 48695ms (client timeout 10000). > W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace: > 0316 10:47:26.361265 (+ 0us) impala-service-pool.cc:165] Inserting onto call > queue > 0316 10:47:26.361285 (+ 20us) impala-service-pool.cc:245] Handling call > 0316 10:48:15.056398 (+48695113us) inbound_call.cc:162] Queueing success > response > Metrics: {} > I0316 10:48:15.057087 20139 connection.cc:584] Got response to call id 202 > after client already timed out or cancelled{noformat} > So far, this has only happened on s3. The system load at the time is not > higher than normal. If anything it is lower than normal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org