Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/19289
Change subject: IMPALA-11751: Template tuple of Avro header should be transferred to ScanRangeSharedState ...................................................................... IMPALA-11751: Template tuple of Avro header should be transferred to ScanRangeSharedState Sequence container based file formats (SequenceFile, RCFile, Avro) have a file header in each file that describes the metadata of the file, e.g. codec, default values, etc. The header should be decoded before reading the file content. The initial scanners will read the header and then issue follow-up scan ranges for the file content. The decoded header will be referenced by follow-up scanners. Since IMPALA-9655, when MT_DOP > 1, the issued scan ranges could be scheduled to other scan node instances. So the header resource should live until all scan node instances close. Header objects are owned by the object pool of the RuntimeState, which meets the requirement. AvroFileHeader is special than other headers in that it references a template tuple which contains the partition values and default values for missing fields. The template tuple is initially owned by the header scanner, then transferred to the scan node before the scanner closes. However, when the scan node instance closes, the template tuple is freed. Scanners of other scan node instances might still depend on it. This could cause wrong results or crash the impalad. When partition columns are used in the query, or when the underlying avro files have missing fields and the table schema has default values for them, the AvroFileHeader will have a non-null template tuple, which could hit this bug when MT_DOP>1. This patch fixes the bug by transferring the template tuple to ScanRangeSharedState directly. The scan_node_pool of HdfsScanNodeBase is also removed since it's only used to hold the template tuple (and related buffers) of the avro header. Also no need to override TransferToScanNodePool in HdfsScanNode since the original purpose is to protect the pool by a lock, and now the method in ScanRangeSharedState already has a lock. Tests - Add missing test coverage for compute stats on avro tables. Note that MT_DOP=4 is set by default for compute stats. - Add the MT_DOP dimension for TestScannersAllTableFormats. Also add some queries that can reveal the bug in scanners.test. The ASAN build can easily crash by heap-use-after-free error with this fix. - Ran related tests locally. Change-Id: Iafa43fce7c2ffdc867004d11e5873327c3d8cb42 --- M be/src/exec/avro/hdfs-avro-scanner.cc M be/src/exec/base-sequence-scanner.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scan-node.cc M be/src/exec/hdfs-scan-node.h M be/src/exec/streaming-aggregation-node.h M testdata/workloads/functional-query/queries/QueryTest/compute-stats-avro.test M testdata/workloads/functional-query/queries/QueryTest/scanners.test M tests/query_test/test_scanners.py 11 files changed, 227 insertions(+), 23 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/19289/1 -- To view, visit http://gerrit.cloudera.org:8080/19289 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Iafa43fce7c2ffdc867004d11e5873327c3d8cb42 Gerrit-Change-Number: 19289 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>