Philip Zeyliger has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/12097


Change subject: IMPALA-7980: Fix spinning threads because of buggy handling of 
num_unqueued_files_.
......................................................................

IMPALA-7980: Fix spinning threads because of buggy handling of 
num_unqueued_files_.

***
WIP:
  I'm still working on the DCHECK() for num_unqueued_files_.
  It can't be in SetDone(), because others may call Close()
  on the scan node. Trying where it is now, but tests haven't run...
***

When running TPC-DS query 1 on scale factor 10,000 (10TB) on a 140-node cluster
with replica_preference=remote, we observed really high system CPU usage for
some of the scan nodes:

  HDFS_SCAN_NODE (id=6):(Total: 59s107ms, non-child: 59s107ms, % non- child: 
100.00%
    - BytesRead: 80.50 MB (84408563)
    - ScannerThreadsSysTime: 36m17s

Using 36 minutes of system time in only 1 minute of wall-clock time required
~30 threads to be spinning in the kernel. We were able to use perf to finda lot
of usage of futex_wait() and pthread_cond_wait(). Eventually, we figured out
that ScannerThreads, once started, loop forever looking for work.  The case
that there is no work is supposed to be rare, and the scanner threads are
supposed to exit based on num_unqueued_files_ being 0, but, in some cases, that
counter isn't appropriately decremented.

The reproduction is any query that uses runtime filters to filter out
entire files. Something like:

  set RUNTIME_FILTER_WAIT_TIME_MS=10000;
  select count(*)
  from customer
  join customer_address on c_current_addr_sk = ca_address_sk
  where ca_street_name="DoesNotExist" and c_last_name="DoesNotExist";

triggers this behavior.

Interestingly, though this wastes cycles, query results are unaffected.

The point fix is to decrement the counter when skipping files.

This bug was co-debugged by Todd Lipcon, Joe McDonnell, Philip Zeyliger,
and Michael Ho.

Change-Id: I133de13238d3d05c510e2ff771d48979125735b1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node.cc
M 
testdata/workloads/functional-query/queries/QueryTest/runtime_filters_wait.test
M tests/query_test/test_runtime_filters.py
4 files changed, 18 insertions(+), 2 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/12097/1
--
To view, visit http://gerrit.cloudera.org:8080/12097
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I133de13238d3d05c510e2ff771d48979125735b1
Gerrit-Change-Number: 12097
Gerrit-PatchSet: 1
Gerrit-Owner: Philip Zeyliger <phi...@cloudera.com>

Reply via email to