Joe McDonnell has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12968
Change subject: IMPALA-8322: Add periodic dirty check of done_ in ThreadTokenAvailableCb ...................................................................... IMPALA-8322: Add periodic dirty check of done_ in ThreadTokenAvailableCb When HdfsScanNode is cancelled or hits and error, SetDoneInternal() holds HdfsScanNode::lock_ while it runs RequestContext::Cancel(), which can wait on IO threads to complete outstanding IOs. This can cause a cascade of blocked threads that causes Prepare() to take a significant time and cause datastream sender timeouts. The specific scenario seen has this set of threads: Thread 1: A DiskIoMgr thread is blocked on IO in hdfsOpenFile() or hdfsRead(), holding HdfsFileReader::lock_. Thread 2: An HDFS scanner thread is blocked in HdfsScanNode::SetDoneInternal() -> RequestContext::Cancel() -> ScanRange::CancelInternal(), waiting on HdfsFileReader::lock_. It is holding HdfsScanNode::lock_. Thread 3: A thread in ThreadResourceMgr::DestroyPool() -> (a few layers) -> HdfsScanNode::ThreadTokenAvailableCb() is blocked waiting on HdfsScanNode::lock_ while holding ThreadResourceMgr::lock_. Thread 4: A thread in FragmentInstanceState::Prepare() -> RuntimeState::Init() -> ThreadResourceMgr::CreatePool() is blocked waiting on ThreadResourceMgr::lock_. When Prepare() takes a significant time, datastream senders will time out waiting for the datastream receivers to start up. This causes failed queries. S3 has higher latencies for IO and does not have file handle caching, so S3 is more susceptible to this issue than other platforms. This changes HdfsScanNode::ThreadTokenAvailableCb() to periodically do a dirty check of HdfsScanNode::done_ when waiting to acquire the lock. This avoids the blocking experienced by Thread 3 in the example above. Testing: - Ran tests on normal HDFS and repeatedly on S3 Change-Id: I4881a3e5bfda64e8d60af95ad13b450cf7f8c130 --- M be/src/common/names.h M be/src/exec/hdfs-scan-node.cc M be/src/exec/hdfs-scan-node.h 3 files changed, 28 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/12968/1 -- To view, visit http://gerrit.cloudera.org:8080/12968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I4881a3e5bfda64e8d60af95ad13b450cf7f8c130 Gerrit-Change-Number: 12968 Gerrit-PatchSet: 1 Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com>