[Impala-ASF-CR] IMPALA-7310: All-null columns give wrong estimates in planner

Paul Rogers (Code Review) Fri, 28 Sep 2018 16:25:31 -0700

Paul Rogers has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11528 )


Change subject: IMPALA-7310: All-null columns give wrong estimates in planner
......................................................................


Patch Set 8:

(32 comments)

Thanks for the reviews. Here are my responses thus far. Most are of the form 
"did it." Vuk, I had a couple of questions on your comments.

Thanks,

- Paul

http://gerrit.cloudera.org:8080/#/c/11528/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11528/2//COMMIT_MSG@7
PS2, Line 7: IMPALA-7310: All-null columns give wrong estimates in planner
> We tend to have a single line subject with a newline after it. That ends up
Other projects require the JIRA ticket number in the commit message. Do we? 
Such projects tend to favor the JIRA ticket title (or a cleaned up version) as 
the description. Do we? In these projects, the assumption is that the patch is 
to fix the listed ticket, so no need to include the word "Fix" in the patch 
title.

Happy to remove the ticket number and change the title to "Correct NDV 
estimates for all-null columns."

I pushed an update with the newline which shows up in my view of the patch.

Did that revision replace the one here on which you commented? Is there a 
better way to revise the commit message?


http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG@24
PS8, Line 24: * Table column (not an internal column such as COUNT(*))
> Base table?
Ack


http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG@27
PS8, Line 27: not provide a null count
> when does this happen when stats are computed?
See revised message. I saw multiple cases in the test DBs in which stats were 
computed, but null count was -1 (undefined), so I needed to handle that case in 
order to get things to work.


http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG@28
PS8, Line 28: NDV <= 1
> if its unknown whether there are nulls or not (from the prev. line), could 
Added to the commit message:

Research for this patch revealed that Impala treats NDVs in two distinct ways:  
                                       

* Stats (which use the NDV function) computes NDV as the number of distinct 
non-null values. (That is, the NDV of (0, null) is 1.
* The planner itself when working with constants, uses a definition of NDV that 
includes nulls. That is, the NDV of (0, null) is 2.

This fix attempts to bridge the two definitions, Since we know that the NDV in 
stats excludes nulls (except for the BOOLEAN type), and we know that the column 
contains nulls, we can bump up the NDV to convert from the stats definition to 
the planner definition. But, to avoid regressions, we do so in a very narrow 
range of NDV values: only 0 and 1.


http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG@31
PS8, Line 31: Any larger adjustment
> confused... if we're adjusting for nulls, and we're treating a null as a si
Reworded:

Technically, the adjustment should apply to all NDV values. However, it turns 
out that if we do so, we end up with many failures in PlannerTest in those 
tests that work with TPC-H tables. The TPC-H tables have multiple columns 
marked as nullable but which actually have no nulls. Some of these columns also 
have a low NDV.                      

By limiting the NDV adjustment to the narrow range, the TPC-H tests need not be 
updated.  Since end users could have a similar situation, the narrow range 
reduces the chance that this fix might impact such workloads.


http://gerrit.cloudera.org:8080/#/c/11528/2/.gitignore
File .gitignore:

http://gerrit.cloudera.org:8080/#/c/11528/2/.gitignore@48
PS2, Line 48:
> I'm not a stickler, but you could separate the .gitignore stuff into a sepa
Reverted all changes to this file.


http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@111
PS8, Line 111:   public ArrayList<Expr> getGroupList() { return groupingExprs_; 
}
> where is this used in this change?
Was used in a test, but that test ended up not being include as part of this 
patch, so removed the line.


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java
File fe/src/main/java/org/apache/impala/analysis/SlotRef.java:

http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@102
PS2, Line 102:   private void computeNdv() {
> line has trailing whitespace
Ack


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@105
PS2, Line 105:     // Potentially adjust NDV for nulls if the column is has 
stats,
> This used to execute even when hasStats() was false. What's numDistinctValu
Defaults to -1.

Put it back as it was.


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@115
PS2, Line 115:     if (desc_.getStats().hasStats()) {
> Comment about why getPath()?
Ack


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@152
PS2, Line 152:     }
             :
             :     computeNdv();
             :
             :     // If this column is for a base table, limit NDV to the 
number
             :     /
> It seems a little weird that numDistinctValues_ is modified both in compute
Yes. As it turns out, there are two paths to build one of these things. One 
path has a base table, and get analyzed, the other does not. If we look 
carefully, it turns out that the resolved path is available only here, not in 
the constructor for that other path.

Wanted to avoid duplicating adjustment, so factored it out, but did not want to 
move more than necessary.


http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/main/java/org/apache/impala/analysis/SlotRef.java
File fe/src/main/java/org/apache/impala/analysis/SlotRef.java:

http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@41
PS8, Line 41: add value
> nit: lots of 'value' in this comment. perhaps replace this with "make sense
Ack


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java
File fe/src/main/java/org/apache/impala/service/Frontend.java:

http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@185
PS2, Line 185:   public static class PlanCtx {
> This line is extra. I saw a similar one somewhere else; we don't typically
Ack


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@1043
PS2, Line 1043:     if (planCtx.planCaptureRequested()) {
> I think "planCtx.shouldCapturePlan()" reads slightly better. Or perhaps pla
Ack


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@1092
PS2, Line 1092:     parser.setQueryOptions(options);
> Are there still users of this one?
Yes. Several in tests, and another in "jniFrontend". Didn't want to expand 
scope by changing those also.


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@1116
PS2, Line 1116:       result.setTimeline(timeline.toThrift());
> Perhaps getQueryContext() would be clearer here?
Ack


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@1142
PS2, Line 1142:                 attempt, INCONSISTENT_METADATA_NUM_RETRIES)
> line too long (108 > 90)
Ack


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java:

http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java@132
PS2, Line 132:     verifyNdv("id * int_col", 7300);
> mention there JIRA where you filed this as a bug?
Ack


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java@141
PS2, Line 141:
> There's code in the frontend that does this, no? Is there a test for that c
Since expressions are computed incorrectly, there is no case where we an exceed 
row count now. If we did calcs correctly (multiply out NDVs for *; etc.) then 
we'd need to test this.

To avoid discussion of hypothetical improvements, simply removed the comment.


http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java:

http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@180
PS2, Line 180:     planCtx.requestPlanCapture();
> line too long (93 > 90)
Ack


http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java:

http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@31
PS6, Line 31:     verifyCardinality("SELECT id FROM functional.alltypes", 7300);
> some of these have semicolons and some don't; perhaps remove all of them?
Ack


http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@153
PS6, Line 153:         "SELECT COUNT(*)" + joinClause + "GROUP BY t1.id", 7300);
> consider renaming to "expectCardinality(query, expected)" which helps make
Ack


http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@156
PS6, Line 156:   }
> Maybe add the query string to the message here?
Ack


http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@168
PS6, Line 168:         expected, planRoot.getCardinality());
> planCtx.setCapturePlan(true)?
Ack


http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@172
PS6, Line 172:     // Set up the query context. Note that we need to deep copy 
it before
> nit: remove
Ack


http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@175
PS6, Line 175:         "default", System.getProperty("user.name"));
> Why bother? The test will fail with the exception if it's thrown.
Sure. Just copying another test here. I suppose the argument is that, without 
this try/catch block, every test will throw the ImpalaException. I usually do 
that as a way of saying that the exception is unexpected, but I wondered if 
Impala had conventions for this kind of thing.


http://gerrit.cloudera.org:8080/#/c/11528/7/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java:

http://gerrit.cloudera.org:8080/#/c/11528/7/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@111
PS7, Line 111:
> line too long (91 > 90)
Ack


http://gerrit.cloudera.org:8080/#/c/11528/7/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@143
PS7, Line 143:   /**
> line too long (96 > 90)
Ack


http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java:

http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@22
PS8, Line 22: private static final boolean DEBUG_MODE = false;
> use a logger.
This was more to help write the test, no need to keep it. Removed the code.


http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@190
PS8, Line 190: if (DEBUG_MODE) {
             :       System.out.println(plan.get(0).getExplainString(
             :           queryOptions, TExplainLevel.EXTENDED));
             :     }
> use a logger
Ack


http://gerrit.cloudera.org:8080/#/c/11528/8/testdata/NullTable/large_data.csv
File testdata/NullTable/large_data.csv:

http://gerrit.cloudera.org:8080/#/c/11528/8/testdata/NullTable/large_data.csv@1
PS8, Line 1: a,,\N,\N,\N,a
> why's this called "large_data"?
Because the directory already contains a "data.csv" file. Couldn't justify 
creating another directory, given that the tables have the same schema, but 
different amounts of data. So, compared to a single-line version, this 26-line 
version is "large."

Should I instead create a separate directory with its own "data.csv" file?


http://gerrit.cloudera.org:8080/#/c/11528/8/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/11528/8/testdata/datasets/functional/functional_schema_template.sql@1355
PS8, Line 1355: ---- CREATE_KUDU
> where's this used for kudu tests? fwict, the refs to this new table are exp
Sorry, just copied the one-line version of this file. Can't claim I understand 
the syntax of this file, and could find no documentation to help.

Which is more important: having the same schema in both Kudu and HDFS, or only 
including tables that are actually referenced, even if that means the schemas 
differ across systems?



--
To view, visit http://gerrit.cloudera.org:8080/11528
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ife657a43c9cafc451bd12ddf857dcb7169e97459
Gerrit-Change-Number: 11528
Gerrit-PatchSet: 8
Gerrit-Owner: Paul Rogers <par0...@yahoo.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <par0...@yahoo.com>
Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 Sep 2018 23:24:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7310: All-null columns give wrong estimates in planner

Reply via email to