Hello Michael Ho, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12097 to look at the new patch set (#8). Change subject: IMPALA-7980: Fix spinning because of buggy num_unqueued_files_. ...................................................................... IMPALA-7980: Fix spinning because of buggy num_unqueued_files_. This commit removes num_unqueued_files_ and replaces it with a more tightly scoped and easier to reason about remaining_scan_range_submissions_ variable. This variable (and its predecessor) are used as a way to signal to scanner threads they may exit (instead of spinning) because there will never be a scan range provided to them, because no more scan ranges will be added. In practice, most scanner implementations can never call AddDiskIoRanges() after IssueInitialRanges(). The exception is sequence files and Avro, which share a common base class. Instead of incrementing and decrementing this counter in a variety of paths, this commit makes the common case simple (set to 1 initially; decrement at exit points of IssueInitialRanges()) and the complicated, sequence-file case is treated within base-sequence-scanner.cc. Note that this is not the first instance of a subtle bug in this code. The following two JIRAs (and corresponding commits) are fundamentally similar bugs: IMPALA-3798: Disable per-split filtering for sequence-based scanners IMPALA-1730: reduce scanner thread spinning windows We ran into this bug when running TPC-DS query 1 on scale factor 10,000 (10TB) on a 140-node cluster with replica_preference=remote, we observed really high system CPU usage for some of the scan nodes: HDFS_SCAN_NODE (id=6):(Total: 59s107ms, non-child: 59s107ms, % non- child: 100.00% - BytesRead: 80.50 MB (84408563) - ScannerThreadsSysTime: 36m17s Using 36 minutes of system time in only 1 minute of wall-clock time required ~30 threads to be spinning in the kernel. We were able to use perf to find a lot of usage of futex_wait() and pthread_cond_wait(). Eventually, we figured out that ScannerThreads, once started, loop forever looking for work. The case that there is no work is supposed to be rare, and the scanner threads are supposed to exit based on num_unqueued_files_ being 0, but, in some cases, that counter isn't appropriately decremented. The reproduction is any query that uses runtime filters to filter out entire files. Something like: set RUNTIME_FILTER_WAIT_TIME_MS=10000; select count(*) from customer join customer_address on c_current_addr_sk = ca_address_sk where ca_street_name="DoesNotExist" and c_last_name="DoesNotExist"; triggers this behavior. This code path is covered by several existing tests, most directly in test_runtime_filters.py:test_file_filtering(). Interestingly, though this wastes cycles, query results are unaffected. I initially fixed this bug with a point fix that handled the case when runtime filters caused files to be skipped and added assertions that checked that num_unqueued_files_ was decremented to zero when queries finished. Doing this led me, somewhat slowly, to both finding similar bugs in other parts of the code (HdfsTextScanner::IssueInitialRanges had the same bug if the entire file was skipped) and fighting with races on the assertion itself. I eventually concluded that there's really no shared synchronization between progress_.Done() and num_unqueued_files_. The same conclusion is true for the current implementation, so there aren't assertions. I added a metric for how many times the scanners run through their loop without doing any work and observed it to be non-zero for a query from tests/query_test/test_runtime_filters.py:test_wait_time. To measure the effect of this, I set up a cluster of 9 impalad's and 1 coordinator, running against an entirely remote HDFS. The machines were r4.4xlarge and the remote disks were EBS st1's, though everything was likely buffer cached. I ran TPCDS-Q1 with RUNTIME_FILTER_WAIT_TIME_MS=2000 against tpcds_1000_decimal_parquet 10 times. The big observable thing is that ScannerThreadsSysTime went from 5.6 seconds per query to 1.9 seconds per query. (I ran the text profiles through the old-fashioned: grep ScannerThreadsSysTime profiles | awk '/ms/ { x += $3/1000 } /ns/ { x += $3/1000000 } END { print x }' ) The query time effect was quite small (the fastest query was 3.373s with the change and 3.82s without the change, but the averages were tighter), but the extra work was visible in the profiles. I happened to rename HdfsScanNode::file_type_counts_ to HdfsScanNode::file_type_counts_lock_ because HdfsScanNodeBase::file_type_counts_ also exists, and is totally different. This bug was co-debugged by Todd Lipcon, Joe McDonnell, and Philip Zeyliger. Change-Id: I133de13238d3d05c510e2ff771d48979125735b1 --- M be/src/exec/base-sequence-scanner.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scan-node.cc M be/src/exec/hdfs-scan-node.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M tests/query_test/test_runtime_filters.py 8 files changed, 82 insertions(+), 50 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/12097/8 -- To view, visit http://gerrit.cloudera.org:8080/12097 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I133de13238d3d05c510e2ff771d48979125735b1 Gerrit-Change-Number: 12097 Gerrit-PatchSet: 8 Gerrit-Owner: Philip Zeyliger <phi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org>