Hello Daniel Becker, Gabor Kaszab, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/21190

to look at the new patch set (#7).

Change subject: IMPALA-12894: (part 2) Fix optimized count(*) for Iceberg 
tables with dangling delete files
......................................................................

IMPALA-12894: (part 2) Fix optimized count(*) for Iceberg tables with dangling 
delete files

Impala can return incorrect results if a table has dangling delete
files. Dangling delete files are delete files that are part of the
snapshot but they are not applicable to any of the data files. We can
have such delete files after Spark's rewrite_data_files action.

During analysis we check the existence of delete files based on the
snapshot summary. If there are no delete files in the table, we just
replace the count(*) expression with NumericLiteral($record_count).
If there are delete files in the table (based on the summary), we set
optimize_count_star_for_iceberg_v2 in the query context.

Without optimize_count_star_for_iceberg_v2 in the query context, the
IcebergScanPlanner would create the following plan.

    AGGREGATE
    COUNT(*)
        |
    UNION ALL
   /         \
  /           \
 /             \
SCAN all    ANTI JOIN
datafiles  /         \
without   /           \
deletes  SCAN         SCAN
         datafiles    deletes
         with deletes

With optimize_count_star_for_iceberg_v2 the final plan looks like
the following:

      ArithmeticExpr(ADD)
      /             \
     /               \
    /                 \
record_count       AGGREGATE
of all             COUNT(*)
datafiles              |
without            ANTI JOIN
deletes           /         \
                 /           \
                SCAN        SCAN
                datafiles   deletes
                with deletes

The ArithmeticExpr(ADD) and its left child (record_count) is created
by the analyzer, IcebergScanPlanner is responsible in creating the
plan under AGGREGATE COUNT(*). And if it has delete files and
optimize_count_star_for_iceberg_v2 is true, it knows it can omit
the original UNION ALL and its left child.

However, IcebergScanPlanner checks delete file existence based on the
result of planFiles(), hence dangling delete files are eliminated.
And if there are no delete files, IcebergScanPlanner assumes that
case is already handled by the Analyzer (i.e. it replaced count(*)
with NumericLiteral($record_count)). So it will incorrectly create a
normal SCAN plan of the table under COUNT(*), i.e. we end up
with this:

      ArithmeticExpr(ADD)
      /             \
     /               \
    /                 \
record_count       AGGREGATE
of all             COUNT(*)
datafiles              |
without              SCAN
deletes            datafiles
                   without
                   deletes

Which means Impala will yield $record_count * 2 as a result.

This patch fixes the FeIcebergTable.hasDeleteFiles() method, so it
also ignores dangling delete files. Therefore, the analyzer will just
substitute count(*) with NumericLiteral($record_count) if all deletes
are dangling, i.e. no need to involve the IcebergScanPlanner at all.

The patch also introduces a new query option,
"iceberg_disable_count_star_optimization", so users can completely
disable the statistic-based count(*)-optimization if necessary.

Testing:
 * e2e tests
 * planner tests

Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables-hash-join.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes-orc.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test
11 files changed, 336 insertions(+), 433 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/90/21190/7
--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to