[ 
https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817454#comment-16817454
 ] 

ASF subversion and git services commented on IMPALA-8322:
---------------------------------------------------------

Commit 8ec17b7cdffbd82ce7b3e652edc2530df083eeab in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8ec17b7 ]

IMPALA-8322: Add periodic dirty check of done_ in ThreadTokenAvailableCb

When HdfsScanNode is cancelled or hits an error, SetDoneInternal() holds
HdfsScanNode::lock_ while it runs RequestContext::Cancel(), which can
wait on IO threads to complete outstanding IOs. This can cause a cascade
of blocked threads that causes Prepare() to take a significant time and
cause datastream sender timeouts.

The specific scenario seen has this set of threads:
Thread 1: A DiskIoMgr thread is blocked on IO in hdfsOpenFile() or
  hdfsRead(), holding HdfsFileReader::lock_.
Thread 2: An HDFS scanner thread is blocked in
  HdfsScanNode::SetDoneInternal() -> RequestContext::Cancel()
  -> ScanRange::CancelInternal(), waiting on HdfsFileReader::lock_.
  It is holding HdfsScanNode::lock_.
Thread 3: A thread in ThreadResourceMgr::DestroyPool() -> (a few layers)
  -> HdfsScanNode::ThreadTokenAvailableCb() is blocked waiting on
  HdfsScanNode::lock_ while holding ThreadResourceMgr::lock_.
Thread 4: A thread in FragmentInstanceState::Prepare()
  -> RuntimeState::Init() -> ThreadResourceMgr::CreatePool() is blocked
  waiting on ThreadResourceMgr::lock_.

When Prepare() takes a significant time, datastream senders will time out
waiting for the datastream receivers to start up. This causes failed
queries. S3 has higher latencies for IO and does not have file handle
caching, so S3 is more susceptible to this issue than other platforms.

This changes HdfsScanNode::ThreadTokenAvailableCb() to periodically do a
dirty check of HdfsScanNode::done_ when waiting to acquire the lock. This
avoids the blocking experienced by Thread 3 in the example above.

Testing:
 - Ran tests on normal HDFS and repeatedly on S3

Change-Id: I4881a3e5bfda64e8d60af95ad13b450cf7f8c130
Reviewed-on: http://gerrit.cloudera.org:8080/12968
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> S3 tests encounter "timed out waiting for receiver fragment instance"
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-8322
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8322
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.3.0
>            Reporter: Joe McDonnell
>            Priority: Blocker
>              Labels: broken-build
>         Attachments: fb5b9729-2d7a-4590-ea365b87-d2ead75e.dmp_dumped, 
> run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
>     self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
>     result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
>     return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
>     return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
>     handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
>     self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
>     raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> E    Query aborted:Sender 127.0.0.1 timed out waiting for receiver fragment 
> instance: 6c40d992bb87af2f:0ce96e5d00000007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in 
> the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> ef4a5dc32a6565bd:a8720b8500000007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 
> timed out waiting for receiver fragment instance: 
> d148d83e11a4603d:54dc35f700000004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call 
> impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 
> 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: 
> CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
> ... various messages ...
> W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call 
> impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request 
> call id 202) took 48695ms (client timeout 10000).
> W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace:
> 0316 10:47:26.361265 (+ 0us) impala-service-pool.cc:165] Inserting onto call 
> queue
> 0316 10:47:26.361285 (+ 20us) impala-service-pool.cc:245] Handling call
> 0316 10:48:15.056398 (+48695113us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {}
> I0316 10:48:15.057087 20139 connection.cc:584] Got response to call id 202 
> after client already timed out or cancelled{noformat}
> So far, this has only happened on s3. The system load at the time is not 
> higher than normal. If anything it is lower than normal. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to