[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-04-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817454#comment-16817454
 ] 

ASF subversion and git services commented on IMPALA-8322:
-

Commit 8ec17b7cdffbd82ce7b3e652edc2530df083eeab in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8ec17b7 ]

IMPALA-8322: Add periodic dirty check of done_ in ThreadTokenAvailableCb

When HdfsScanNode is cancelled or hits an error, SetDoneInternal() holds
HdfsScanNode::lock_ while it runs RequestContext::Cancel(), which can
wait on IO threads to complete outstanding IOs. This can cause a cascade
of blocked threads that causes Prepare() to take a significant time and
cause datastream sender timeouts.

The specific scenario seen has this set of threads:
Thread 1: A DiskIoMgr thread is blocked on IO in hdfsOpenFile() or
  hdfsRead(), holding HdfsFileReader::lock_.
Thread 2: An HDFS scanner thread is blocked in
  HdfsScanNode::SetDoneInternal() -> RequestContext::Cancel()
  -> ScanRange::CancelInternal(), waiting on HdfsFileReader::lock_.
  It is holding HdfsScanNode::lock_.
Thread 3: A thread in ThreadResourceMgr::DestroyPool() -> (a few layers)
  -> HdfsScanNode::ThreadTokenAvailableCb() is blocked waiting on
  HdfsScanNode::lock_ while holding ThreadResourceMgr::lock_.
Thread 4: A thread in FragmentInstanceState::Prepare()
  -> RuntimeState::Init() -> ThreadResourceMgr::CreatePool() is blocked
  waiting on ThreadResourceMgr::lock_.

When Prepare() takes a significant time, datastream senders will time out
waiting for the datastream receivers to start up. This causes failed
queries. S3 has higher latencies for IO and does not have file handle
caching, so S3 is more susceptible to this issue than other platforms.

This changes HdfsScanNode::ThreadTokenAvailableCb() to periodically do a
dirty check of HdfsScanNode::done_ when waiting to acquire the lock. This
avoids the blocking experienced by Thread 3 in the example above.

Testing:
 - Ran tests on normal HDFS and repeatedly on S3

Change-Id: I4881a3e5bfda64e8d60af95ad13b450cf7f8c130
Reviewed-on: http://gerrit.cloudera.org:8080/12968
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: fb5b9729-2d7a-4590-ea365b87-d2ead75e.dmp_dumped, 
> run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> d148d83e11a4603d:54dc35f70004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: 
> CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (

[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-04-05 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811027#comment-16811027
 ] 

Joe McDonnell commented on IMPALA-8322:
---

One way to fix this is to remove the thread available callback earlier. 
Currently is it done in HdfsScanNode::Close(), but it might be possible to move 
it to SetDoneInternal(). If HdfsScanNode::SetDoneInternal() calls it before 
RequestContext::Cancel(), then that reduces the window where a caller in 
ThreadResourceMgr::DestroyPool() can call it and get stuck trying to get the 
lock. An alternative is for HdfsScanNode::ThreadTokenAvailableCb() to do a 
dirty check on done_ without holding the lock and bail if it is true.

In addition, it might make sense to look at how RequestContext::Cancel() works 
when there are multiple scan ranges and multiple IO threads. It immediately 
sets state that would prevent IO threads from getting more scan ranges from 
this RequestContext. But several IOs might be in progress. Right now, it does 
this loop:
{code:java}
for (ScanRange* range : active_scan_ranges_) {
range->CancelInternal(CONTEXT_CANCELLED, false);
}
{code}
Each ScanRange::CancelInternal() needs to get a lock on the scan range and the 
HdfsFileReader. The HdfsFileReader lock is held during opens and reads, so this 
will wait on IO. While the cancel thread is waiting on the first 
HdfsFileReader, another IO thread might be finishing an open on the second 
HdfsFileReader. Right now, it can proceed to do a read, because it doesn't know 
that it is cancelled yet (it is looking at ScanRange::cancel_status_). So, 
after waiting on the first HdfsFileReader, it might also need to wait on the 
second HdfsFileReader. (There is a limit, because IOs only take so long, and 
the IO threads can't get more ranges.)

If we looped over the ScanRanges setting cancel_status_ (or some other state) 
without grabbing the HdfsFileReader lock, the extra read could be avoided, 
because the IO thread would know immediately that it is cancelled (and it can 
work on scan ranges from other queries). Then, the cancel thread could loop 
over to wait for the IO threads to finish.

> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: fb5b9729-2d7a-4590-ea365b87-d2ead75e.dmp_dumped, 
> run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> d148d83e11a4603d:54dc35f70004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: 
> CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
> ... various messages ...
> W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call 
> impala

[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-04-04 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810312#comment-16810312
 ] 

Joe McDonnell commented on IMPALA-8322:
---

The cancellation test is running, with multiple cancels in progress. It is 
likely that ThreadResourceMgr::DestroyPool() is being called frequently.

FYI: [~kwho] [~lv] [~twm378]

 

> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: fb5b9729-2d7a-4590-ea365b87-d2ead75e.dmp_dumped, 
> run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> d148d83e11a4603d:54dc35f70004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: 
> CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
> ... various messages ...
> W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call 
> impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request 
> call id 202) took 48695ms (client timeout 1).
> W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace:
> 0316 10:47:26.361265 (+ 0us) impala-service-pool.cc:165] Inserting onto call 
> queue
> 0316 10:47:26.361285 (+ 20us) impala-service-pool.cc:245] Handling call
> 0316 10:48:15.056398 (+48695113us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {}
> I0316 10:48:15.057087 20139 connection.cc:584] Got response to call id 202 
> after client already timed out or cancelled{noformat}
> So far, this has only happened on s3. The system load at the time is not 
> higher than normal. If anything it is lower than normal. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-04-02 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808278#comment-16808278
 ] 

Joe McDonnell commented on IMPALA-8322:
---

Ok, continuing below that level. The thread in ThreadResourceMgr::DestroyPool() 
is in HdfsScanNode::ThreadTokenAvailableCb(), waiting on HdfsScanNode's lock_. 
It looks like the thread holding that lock is in HdfsScanNode::SetError():
{noformat}
0 impalad!sys_futex [linux_syscall_support.h : 2713 + 0x24]
1 impalad!base::internal::SpinLockDelay(int volatile*, int, int) 
[spinlock_linux-inl.h : 88 + 0x19]
2 impalad!base::SpinLock::SlowLock() [spinlock.cc : 135 + 0x18]
3 impalad!base::SpinLock::Lock() [spinlock.h : 75 + 0xc]
4 impalad!impala::SpinLock::lock() [spinlock.h : 34 + 0xc]
5 impalad!boost::unique_lock::lock() [lock_types.hpp : 346 + 
0xf]
6 impalad!boost::unique_lock::unique_lock(impala::SpinLock&) 
[lock_types.hpp : 124 + 0xc]
7 impalad!impala::io::ScanRange::CancelInternal(impala::Status const&, bool) 
[scan-range.cc : 353 + 0x32]
8 impalad!impala::io::RequestContext::Cancel() [request-context.cc : 280 + 0x1b]
9 impalad!impala::HdfsScanNode::SetDoneInternal(impala::Status const&) 
[hdfs-scan-node.cc : 554 + 0x1a]
10 impalad!impala::HdfsScanNode::SetError(impala::Status const&) 
[hdfs-scan-node.cc : 566 + 0x13]
11 
impalad!impala::HdfsScanNode::ProcessSplit(std::vector > const&, impala::MemPool*, 
impala::io::ScanRange*, long*) [hdfs-scan-node.cc : 538 + 0x19]
12 impalad!impala::HdfsScanNode::ScannerThread(bool, long) [hdfs-scan-node.cc : 
418 + 0x2a]
13 
impalad!impala::HdfsScanNode::ThreadTokenAvailableCb(impala::ThreadResourcePool*)::{lambda()#1}::operator()()
 const + 0x30
...{noformat}
That thread is waiting in ScanRange::CancelInternal() trying to lock 
HdfsFileReader lock_. HdfsFileReader holds this lock during Open(), so at that 
point we are waiting on opening an S3 file handle. This could take a while, and 
that also explains why this only shows up on S3.

> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: fb5b9729-2d7a-4590-ea365b87-d2ead75e.dmp_dumped, 
> run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> d148d83e11a4603d:54dc35f70004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: 
> CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
> ... various messages ...
> W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call 
> impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request 
> call id 202) took 48695ms (client timeout 1).
> W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace:
> 0

[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-04-02 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808271#comment-16808271
 ] 

Joe McDonnell commented on IMPALA-8322:
---

I created a diagnostic change so that if the coordinator sees the "timed out 
waiting for receiver" error, it sends along an extra field on the cancel for 
that query that causes the executor to LOG(FATAL) and get a minidump. The 
problem reproduced, and now we have a minidump/core for the executor impalad 
that wasn't starting the receiver.

Some threads doing Cancel() are stuck in QueryState::WaitForPrepare():
{noformat}
0 libpthread-2.17.so + 0xb945
1 impalad!impala::Promise::Get() [promise.h : 87 
+ 0x13]
2 impalad!impala::TypedCountingBarrier::Wait() [counting-barrier.h : 59 + 
0xc]
3 impalad!impala::CountingBarrier::Wait() [counting-barrier.h : 95 + 0xc]
4 impalad!impala::QueryState::WaitForPrepare() [query-state.cc : 478 + 0x1a]
5 impalad!impala::QueryState::Cancel() [query-state.cc : 659 + 0x16]
6 
impalad!impala::ControlService::CancelQueryFInstances(impala::CancelQueryFInstancesRequestPB
 const*, impala::CancelQueryFInstancesResponsePB*, kudu::rpc::RpcContext*) 
[control-service.cc : 183 + 0x17]
7 
impalad!impala::ControlServiceIf::ControlServiceIf(scoped_refptr
 const&, scoped_refptr 
const&)::{lambda(google::protobuf::Message const*, google::protobuf::Message*, 
kudu::rpc::RpcContext*)#4}::operator()(google::protobuf::Message const*, 
google::protobuf::Message*, kudu::rpc::RpcContext*) const + 0x3e
8 impalad!std::_Function_handler&, const 
scoped_refptr&):: >::_M_invoke [functional : 2039 + 0x20]
...{noformat}
Those threads are stuck there because there are a bunch of threads that are 
stuck in ThreadResourceMgr::CreatePool() (there are 25 threads here):

 
{noformat}
0 libpthread-2.17.so + 0xe42d
1 impalad!boost::unique_lock::lock() [lock_types.hpp : 346 + 0xf]
2 impalad!boost::unique_lock::unique_lock(boost::mutex&) 
[lock_types.hpp : 124 + 0xc]
3 impalad!impala::ThreadResourceMgr::CreatePool() [thread-resource-mgr.cc : 52 
+ 0x17]
4 impalad!impala::RuntimeState::Init() [runtime-state.cc : 117 + 0x22]
5 impalad!impala::RuntimeState::RuntimeState(impala::QueryState*, 
impala::TPlanFragmentCtx const&, impala::TPlanFragmentInstanceCtx const&, 
impala::ExecEnv*) [runtime-state.cc : 79 + 0xc]
6 impalad!impala::FragmentInstanceState::Prepare() [fragment-instance-state.cc 
: 137 + 0x4b]
7 impalad!impala::FragmentInstanceState::Exec() [fragment-instance-state.cc : 
73 + 0x19]
8 impalad!impala::QueryState::ExecFInstance(impala::FragmentInstanceState*) 
[query-state.cc : 639 + 0x19]
9 impalad!impala::QueryState::StartFInstances()::{lambda()#1}::operator()() 
const + 0x26
...{noformat}
Those threads are waiting on a lock held by a thread in 
ThreadResourceMgr::DestroyPool():
{noformat}
0 libpthread-2.17.so + 0xe42d
1 impalad!boost::unique_lock::lock() [lock_types.hpp : 346 + 0xf]
2 impalad!boost::unique_lock::unique_lock(boost::mutex&) 
[lock_types.hpp : 124 + 0xc]
3 
impalad!impala::HdfsScanNode::ThreadTokenAvailableCb(impala::ThreadResourcePool*)
 [hdfs-scan-node.cc : 296 + 0x20]
4 impalad!boost::_mfi::mf1::operator()(impala::HdfsScanNode*, 
impala::ThreadResourcePool*) const [mem_fn_template.hpp : 165 + 0x1e]
5 impalad!void boost::_bi::list2, 
boost::arg<1> >::operator(), boost::_bi::list1 
>(boost::_bi::type, boost::_mfi::mf1&, 
boost::_bi::list1&, int) [bind.hpp : 313 + 0x4e]
6 impalad!void boost::_bi::bind_t, 
boost::_bi::list2, boost::arg<1> > 
>::operator()(impala::ThreadResourcePool*&) 
[bind_template.hpp : 32 + 0x22]
7 
impalad!boost::detail::function::void_function_obj_invoker1, 
boost::_bi::list2, boost::arg<1> > >, 
void, 
impala::ThreadResourcePool*>::invoke(boost::detail::function::function_buffer&, 
impala::ThreadResourcePool*) [function_template.hpp : 153 + 0x13]
8 impalad!boost::function1::operator()(impala::ThreadResourcePool*) const 
[function_template.hpp : 767 + 0x18]
9 impalad!impala::ThreadResourcePool::InvokeCallbacks() [thread-resource-mgr.cc 
: 105 + 0x19]
10 
impalad!impala::ThreadResourceMgr::UpdatePoolQuotas(impala::ThreadResourcePool*)
 [thread-resource-mgr.cc : 120 + 0xc]
11 
impalad!impala::ThreadResourceMgr::DestroyPool(std::unique_ptr >) [thread-resource-mgr.cc : 
71 + 0x14]
12 impalad!impala::RuntimeState::ReleaseResources() [runtime-state.cc : 281 + 
0x46]
13 impalad!impala::FragmentInstanceState::Close() [fragment-instance-state.cc : 
406 + 0x13]
14 impalad!impala::FragmentInstanceState::Exec() [fragment-instance-state.cc : 
99 + 0xf]
15 impalad!impala::QueryState::ExecFInstance(impala::FragmentInstanceState*) 
[query-state.cc : 639 + 0x19]
16 impalad!impala::QueryState::StartFInstances()::{lambda()#1}::operator()() 
const + 0x26
...{noformat}
I'm still exploring below that level, but there is a tangle of locks and 
threads waiting.

 

 

 

> S3 tests encounter "timed out waiting for receiver fragment instance"

[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-03-25 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801266#comment-16801266
 ] 

Joe McDonnell commented on IMPALA-8322:
---

Fun fact: test_last_ddl_time_update.py (which is a test that has hit this 
issue) is disable for core tests, except on s3:

https://github.com/apache/impala/blob/master/tests/metadata/test_last_ddl_time_update.py#L39-L42
{code:java}
if cls.exploration_strategy() == 'core' and not IS_S3:
  # Don't run on core. This test is very slow and we are unlikely
  # to regress here.
  cls.ImpalaTestMatrix.add_constraint(lambda v: False)
{code}

> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> d148d83e11a4603d:54dc35f70004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: 
> CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
> ... various messages ...
> W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call 
> impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request 
> call id 202) took 48695ms (client timeout 1).
> W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace:
> 0316 10:47:26.361265 (+ 0us) impala-service-pool.cc:165] Inserting onto call 
> queue
> 0316 10:47:26.361285 (+ 20us) impala-service-pool.cc:245] Handling call
> 0316 10:48:15.056398 (+48695113us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {}
> I0316 10:48:15.057087 20139 connection.cc:584] Got response to call id 202 
> after client already timed out or cancelled{noformat}
> So far, this has only happened on s3. The system load at the time is not 
> higher than normal. If anything it is lower than normal. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-03-25 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800987#comment-16800987
 ] 

Joe McDonnell commented on IMPALA-8322:
---

Tracing from s3 run with [https://gerrit.cloudera.org/#/c/12832/] : 
{noformat}
I0325 11:50:19.594966 31829 krpc-data-stream-mgr.cc:362] Sender 127.0.0.1 timed 
out waiting for receiver fragment instance: 474a38fc210026e0:d8d388d5, 
dest node: 2
I0325 11:50:19.595325 31829 rpcz_store.cc:265] Call 
impala.DataStreamService.TransmitData from 127.0.0.1:52770 (request call id 
9447) took 125425ms. Request Metrics: {"find_receiver_us":4}
I0325 11:50:19.595376 31829 krpc-data-stream-mgr.cc:362] Sender 127.0.0.1 timed 
out waiting for receiver fragment instance: 474a38fc210026e0:d8d388d5, 
dest node: 2
I0325 11:50:19.595402 31829 rpcz_store.cc:265] Call 
impala.DataStreamService.TransmitData from 127.0.0.1:52784 (request call id 
25823) took 125424ms. Request Metrics: {"find_receiver_us":10}
I0325 11:50:19.595419 31829 krpc-data-stream-mgr.cc:362] Sender 127.0.0.1 timed 
out waiting for receiver fragment instance: 054bc34a530449e3:66124dd8, 
dest node: 2
I0325 11:50:19.595434 31829 rpcz_store.cc:265] Call 
impala.DataStreamService.TransmitData from 127.0.0.1:52770 (request call id 
9446) took 125425ms. Request Metrics: {"find_receiver_us":14}
I0325 11:50:19.595453 31829 krpc-data-stream-mgr.cc:362] Sender 127.0.0.1 timed 
out waiting for receiver fragment instance: 054bc34a530449e3:66124dd8, 
dest node: 2
I0325 11:50:19.595468 31829 rpcz_store.cc:265] Call 
impala.DataStreamService.TransmitData from 127.0.0.1:52784 (request call id 
25822) took 125425ms. Request Metrics: {"find_receiver_us":7}
...
W0325 11:50:37.711663 31807 rpcz_store.cc:251] Call 
impala.ControlService.CancelQueryFInstances from 127.0.0.1:53656 (request call 
id 183) took 18114ms (client timeout 1).
...
W0325 11:50:37.712100 31807 rpcz_store.cc:255] Trace:
0325 11:50:19.596826 (+ 0us) impala-service-pool.cc:165] Inserting onto call 
queue
0325 11:50:19.596841 (+ 15us) impala-service-pool.cc:245] Handling call
0325 11:50:19.596879 (+ 38us) query-exec-mgr.cc:99] Found query 
054bc34a530449e3:66124dd8
0325 11:50:37.711617 (+18114738us) query-state.cc:657] Cancelling 
FragmentInstanceStates...
0325 11:50:37.711623 (+ 6us) krpc-data-stream-mgr.cc:328] Cancelling stream 
KrpcDataStreamMgr 054bc34a530449e3:66124dd8
0325 11:50:37.711640 (+ 17us) krpc-data-stream-recvr.cc:591] Cancelling stream 
fragment_instance_id=054bc34a530449e3:66124dd8 node_id=2
0325 11:50:37.711652 (+ 12us) inbound_call.cc:162] Queueing success response
Metrics: {"KrpcDataStreamMgr::lock_us":0,"WaitForPrepare_us":18114730}{noformat}
Tests still running, but I'm working on getting the logs.

> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-m

[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-03-25 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800886#comment-16800886
 ] 

Joe McDonnell commented on IMPALA-8322:
---

Ran with [~tlipcon]'s change a couple times, but the issue did not reproduce. I 
will keep trying.

As part of the work for IMPALA-8344, I ran s3 test with s3guard, and this issue 
reproduced there.

> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> d148d83e11a4603d:54dc35f70004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: 
> CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
> ... various messages ...
> W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call 
> impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request 
> call id 202) took 48695ms (client timeout 1).
> W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace:
> 0316 10:47:26.361265 (+ 0us) impala-service-pool.cc:165] Inserting onto call 
> queue
> 0316 10:47:26.361285 (+ 20us) impala-service-pool.cc:245] Handling call
> 0316 10:48:15.056398 (+48695113us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {}
> I0316 10:48:15.057087 20139 connection.cc:584] Got response to call id 202 
> after client already timed out or cancelled{noformat}
> So far, this has only happened on s3. The system load at the time is not 
> higher than normal. If anything it is lower than normal. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-03-21 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798551#comment-16798551
 ] 

Todd Lipcon commented on IMPALA-8322:
-

I'll put up a review in a minute with more TRACE() calls in these code paths, 
hopefully will get a better smoking gun out of the tests if they keep failing.

> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> d148d83e11a4603d:54dc35f70004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: 
> CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
> ... various messages ...
> W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call 
> impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request 
> call id 202) took 48695ms (client timeout 1).
> W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace:
> 0316 10:47:26.361265 (+ 0us) impala-service-pool.cc:165] Inserting onto call 
> queue
> 0316 10:47:26.361285 (+ 20us) impala-service-pool.cc:245] Handling call
> 0316 10:48:15.056398 (+48695113us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {}
> I0316 10:48:15.057087 20139 connection.cc:584] Got response to call id 202 
> after client already timed out or cancelled{noformat}
> So far, this has only happened on s3. The system load at the time is not 
> higher than normal. If anything it is lower than normal. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-03-21 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798442#comment-16798442
 ] 

Joe McDonnell commented on IMPALA-8322:
---

Ran a couple times with datastream_sender_timeout_ms set to 5 minutes and it 
didn't reproduce in those runs.

> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> d148d83e11a4603d:54dc35f70004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: 
> CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
> ... various messages ...
> W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call 
> impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request 
> call id 202) took 48695ms (client timeout 1).
> W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace:
> 0316 10:47:26.361265 (+ 0us) impala-service-pool.cc:165] Inserting onto call 
> queue
> 0316 10:47:26.361285 (+ 20us) impala-service-pool.cc:245] Handling call
> 0316 10:48:15.056398 (+48695113us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {}
> I0316 10:48:15.057087 20139 connection.cc:584] Got response to call id 202 
> after client already timed out or cancelled{noformat}
> So far, this has only happened on s3. The system load at the time is not 
> higher than normal. If anything it is lower than normal. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

2019-03-20 Thread Joe McDonnell (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797515#comment-16797515
 ] 

Joe McDonnell commented on IMPALA-8322:
---

Added run_tests_swimlane.json.gz a swimlane view of pytest execution during the 
failure scenario. (This can be viewed in Chrome by going to about://tracing and 
loading the file.) For this run, the following tests were running concurrently 
when the failure occurred:
 # 
query_test/test_decimal_fuzz.py::TestDecimalFuzz::()::test_decimal_ops[exec_option:
 \{'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 5000, 
'disable_codegen': False, 'abort_on_error': 1, 
'exec_single_node_rows_threshold': 0}]
 # 
query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[protocol:
 beeswax | table_format: avro/snap/block | exec_option: \{'batch_size': 0, 
'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | query_type: SELECT 
| wait_action: 0:GETNEXT:WAIT | cancel_delay: 0.01 | cpu_limit_s: 10 | 
query: select * from lineitem order by l_orderkey | fail_rpc_action: 
COORD_CANCEL_QUERY_FINSTANCES_RPC:FAIL | join_before_close: True | 
buffer_pool_limit: 0]
 # query_test/test_kudu.py::TestDropDb::()::test_drop_non_empty_db
 # query_test/test_exprs.py::TestExprs::()::test_exprs[protocol: beeswax | 
exec_option: \{'batch_size': 0, 'num_nodes': 0, 
'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: 
text/none | enable_expr_rewrites: 0]
 # 
query_test/test_chars.py::TestStringQueries::()::test_chars_tmp_tables[protocol:
 beeswax | exec_option: \{'batch_size': 0, 'num_nodes': 0, 
'disable_codegen_rows_threshold': 0, 'disable_codegen': True, 'abort_on_error': 
1, 'exec_single_node_rows_threshold': 0} | table_format: text/none]
 # 
query_test/test_aggregation.py::TestDistinctAggregation::()::test_multiple_distinct[protocol:
 beeswax | exec_option: \{'disable_codegen': True, 'shuffle_distinct_exprs': 
False} | table_format: text/none]
 # query_test/test_mt_dop.py::TestMtDop::()::test_compute_stats[mt_dop: 8 | 
protocol: beeswax | exec_option: \{'batch_size': 0, 'num_nodes': 0, 
'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: 
parquet/none]
 # 
metadata/test_last_ddl_time_update.py::TestLastDdlTimeUpdate::()::test_kudu[protocol:
 beeswax | exec_option: \{'batch_size': 0, 'num_nodes': 0, 
'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: 
text/none]

Of these, #3 and #8 hit the error in this run. Some of the others (like #6, #7) 
have hit issues in previous runs (and some others that have hit issues are not 
listed here).

> S3 tests encounter "timed out waiting for receiver fragment instance"
> -
>
> Key: IMPALA-8322
> URL: https://issues.apache.org/jira/browse/IMPALA-8322
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 3.3.0
>Reporter: Joe McDonnell
>Priority: Blocker
>  Labels: broken-build
> Attachments: run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
> self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
> raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EQuery aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d0007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b850007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.