Joe McDonnell has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/12968


Change subject: IMPALA-8322: Add periodic dirty check of done_ in 
ThreadTokenAvailableCb
......................................................................

IMPALA-8322: Add periodic dirty check of done_ in ThreadTokenAvailableCb

When HdfsScanNode is cancelled or hits and error, SetDoneInternal() holds
HdfsScanNode::lock_ while it runs RequestContext::Cancel(), which can
wait on IO threads to complete outstanding IOs. This can cause a cascade
of blocked threads that causes Prepare() to take a significant time and
cause datastream sender timeouts.

The specific scenario seen has this set of threads:
Thread 1: A DiskIoMgr thread is blocked on IO in hdfsOpenFile() or
  hdfsRead(), holding HdfsFileReader::lock_.
Thread 2: An HDFS scanner thread is blocked in
  HdfsScanNode::SetDoneInternal() -> RequestContext::Cancel()
  -> ScanRange::CancelInternal(), waiting on HdfsFileReader::lock_.
  It is holding HdfsScanNode::lock_.
Thread 3: A thread in ThreadResourceMgr::DestroyPool() -> (a few layers)
  -> HdfsScanNode::ThreadTokenAvailableCb() is blocked waiting on
  HdfsScanNode::lock_ while holding ThreadResourceMgr::lock_.
Thread 4: A thread in FragmentInstanceState::Prepare()
  -> RuntimeState::Init() -> ThreadResourceMgr::CreatePool() is blocked
  waiting on ThreadResourceMgr::lock_.

When Prepare() takes a significant time, datastream senders will time out
waiting for the datastream receivers to start up. This causes failed
queries. S3 has higher latencies for IO and does not have file handle
caching, so S3 is more susceptible to this issue than other platforms.

This changes HdfsScanNode::ThreadTokenAvailableCb() to periodically do a
dirty check of HdfsScanNode::done_ when waiting to acquire the lock. This
avoids the blocking experienced by Thread 3 in the example above.

Testing:
 - Ran tests on normal HDFS and repeatedly on S3

Change-Id: I4881a3e5bfda64e8d60af95ad13b450cf7f8c130
---
M be/src/common/names.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
3 files changed, 28 insertions(+), 16 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12968/1
--
To view, visit http://gerrit.cloudera.org:8080/12968
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I4881a3e5bfda64e8d60af95ad13b450cf7f8c130
Gerrit-Change-Number: 12968
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com>

Reply via email to