Hello Philip Zeyliger, Bikramjeet Vig, Impala Public Jenkins, Vuk Ercegovac,
I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/11528 to look at the new patch set (#13). Change subject: IMPALA-7310: Use NDV=1 for a Column with all nulls ...................................................................... IMPALA-7310: Use NDV=1 for a Column with all nulls Modified the planner to handle low-value NDVs by adjusting them upward by one to account for null values. Thus, even an all-null column, which has an NDV of 0 in stats, will have an NDV of 1 in the planner. (The planner already expects NDV to include nulls.) Modified the front end to allow capturing the full plan for use in a unit test. Added unit tests that verify estimated cardinality for a plan as a way to verify that the fix solved the scenario in IMPALA-7310. Testing required a new table, similar to the existing nulltable, but which has multiple rows and has stats calculated. The change was limited to a very narrow range of cases: * Base table column (not an internal column such as COUNT(*)) * Type is not BOOLEAN (turns out metadata does the needed NDV correction for BOOLEAN only.) * Column has stats * Column is nullable * Column does not provide a null count, or null count > 0 * Reported NDV <= 1 Testing showed that, at least for the functional test tables, we do have cases in which stats are computed, but the null count is -1 (undefined), which is why null count had to be considered. If we know the null count, and the null count is zero, then no adjustment is needed, But, if we don't know the null count, or it is positive, then adjustment may be needed. Research for this patch revealed that Impala treats NDVs in two distinct ways: * Stats (which use the NDV function) computes NDV as the number of distinct non-null values. (That is, the NDV of (0, null) is 1.) * The planner itself when working with constants, uses a definition of NDV that includes nulls. That is, the NDV of (0, null) is 2. This fix attempts to bridge the two definitions, Since we know that the NDV in stats excludes nulls (except for the BOOLEAN type), and we know that the column contains nulls, we can bump up the NDV to convert from the stats definition to the planner definition. But, to avoid regressions, we do so in a very narrow range of NDV values: only 0 and 1. Technically, the adjustment should apply to all NDV values. However, it turns out that if we do so, we end up with many failures in PlannerTest in those tests that work with TPC-H tables. The TPC-H tables have multiple columns marked as nullable but which actually have no nulls. Some of these columns also have a low NDV. By limiting the NDV adjustment to the narrow range, the TPC-H tests need not be updated. Since end users could have a similar situation, the narrow range reduces the chance that this fix might impact such workloads. Although the change minimized impact on PlannerTest, some memory numbers needed adjusting for a test in which one column hit the criteria listed above and had its NDV adjusted. Bug: IMPALA-7310: All-null columns give wrong estimates in planner Change-Id: Ife657a43c9cafc451bd12ddf857dcb7169e97459 --- M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java A fe/src/test/java/org/apache/impala/planner/CardinalityTest.java A testdata/NullRows/data.csv M testdata/bin/compute-table-stats.sh M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv 9 files changed, 455 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/28/11528/13 -- To view, visit http://gerrit.cloudera.org:8080/11528 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ife657a43c9cafc451bd12ddf857dcb7169e97459 Gerrit-Change-Number: 11528 Gerrit-PatchSet: 13 Gerrit-Owner: Paul Rogers <par0...@yahoo.com> Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Paul Rogers <par0...@yahoo.com> Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com> Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>