Philip Zeyliger has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12097
Change subject: IMPALA-7980: Fix spinning threads because of buggy handling of num_unqueued_files_. ...................................................................... IMPALA-7980: Fix spinning threads because of buggy handling of num_unqueued_files_. *** WIP: I'm still working on the DCHECK() for num_unqueued_files_. It can't be in SetDone(), because others may call Close() on the scan node. Trying where it is now, but tests haven't run... *** When running TPC-DS query 1 on scale factor 10,000 (10TB) on a 140-node cluster with replica_preference=remote, we observed really high system CPU usage for some of the scan nodes: HDFS_SCAN_NODE (id=6):(Total: 59s107ms, non-child: 59s107ms, % non- child: 100.00% - BytesRead: 80.50 MB (84408563) - ScannerThreadsSysTime: 36m17s Using 36 minutes of system time in only 1 minute of wall-clock time required ~30 threads to be spinning in the kernel. We were able to use perf to finda lot of usage of futex_wait() and pthread_cond_wait(). Eventually, we figured out that ScannerThreads, once started, loop forever looking for work. The case that there is no work is supposed to be rare, and the scanner threads are supposed to exit based on num_unqueued_files_ being 0, but, in some cases, that counter isn't appropriately decremented. The reproduction is any query that uses runtime filters to filter out entire files. Something like: set RUNTIME_FILTER_WAIT_TIME_MS=10000; select count(*) from customer join customer_address on c_current_addr_sk = ca_address_sk where ca_street_name="DoesNotExist" and c_last_name="DoesNotExist"; triggers this behavior. Interestingly, though this wastes cycles, query results are unaffected. The point fix is to decrement the counter when skipping files. This bug was co-debugged by Todd Lipcon, Joe McDonnell, Philip Zeyliger, and Michael Ho. Change-Id: I133de13238d3d05c510e2ff771d48979125735b1 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node.cc M testdata/workloads/functional-query/queries/QueryTest/runtime_filters_wait.test M tests/query_test/test_runtime_filters.py 4 files changed, 18 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/12097/1 -- To view, visit http://gerrit.cloudera.org:8080/12097 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I133de13238d3d05c510e2ff771d48979125735b1 Gerrit-Change-Number: 12097 Gerrit-PatchSet: 1 Gerrit-Owner: Philip Zeyliger <phi...@cloudera.com>