Paul Rogers has posted comments on this change. ( http://gerrit.cloudera.org:8080/11528 )
Change subject: IMPALA-7310: All-null columns give wrong estimates in planner ...................................................................... Patch Set 8: (32 comments) Thanks for the reviews. Here are my responses thus far. Most are of the form "did it." Vuk, I had a couple of questions on your comments. Thanks, - Paul http://gerrit.cloudera.org:8080/#/c/11528/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11528/2//COMMIT_MSG@7 PS2, Line 7: IMPALA-7310: All-null columns give wrong estimates in planner > We tend to have a single line subject with a newline after it. That ends up Other projects require the JIRA ticket number in the commit message. Do we? Such projects tend to favor the JIRA ticket title (or a cleaned up version) as the description. Do we? In these projects, the assumption is that the patch is to fix the listed ticket, so no need to include the word "Fix" in the patch title. Happy to remove the ticket number and change the title to "Correct NDV estimates for all-null columns." I pushed an update with the newline which shows up in my view of the patch. Did that revision replace the one here on which you commented? Is there a better way to revise the commit message? http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG@24 PS8, Line 24: * Table column (not an internal column such as COUNT(*)) > Base table? Ack http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG@27 PS8, Line 27: not provide a null count > when does this happen when stats are computed? See revised message. I saw multiple cases in the test DBs in which stats were computed, but null count was -1 (undefined), so I needed to handle that case in order to get things to work. http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG@28 PS8, Line 28: NDV <= 1 > if its unknown whether there are nulls or not (from the prev. line), could Added to the commit message: Research for this patch revealed that Impala treats NDVs in two distinct ways: * Stats (which use the NDV function) computes NDV as the number of distinct non-null values. (That is, the NDV of (0, null) is 1. * The planner itself when working with constants, uses a definition of NDV that includes nulls. That is, the NDV of (0, null) is 2. This fix attempts to bridge the two definitions, Since we know that the NDV in stats excludes nulls (except for the BOOLEAN type), and we know that the column contains nulls, we can bump up the NDV to convert from the stats definition to the planner definition. But, to avoid regressions, we do so in a very narrow range of NDV values: only 0 and 1. http://gerrit.cloudera.org:8080/#/c/11528/8//COMMIT_MSG@31 PS8, Line 31: Any larger adjustment > confused... if we're adjusting for nulls, and we're treating a null as a si Reworded: Technically, the adjustment should apply to all NDV values. However, it turns out that if we do so, we end up with many failures in PlannerTest in those tests that work with TPC-H tables. The TPC-H tables have multiple columns marked as nullable but which actually have no nulls. Some of these columns also have a low NDV. By limiting the NDV adjustment to the narrow range, the TPC-H tests need not be updated. Since end users could have a similar situation, the narrow range reduces the chance that this fix might impact such workloads. http://gerrit.cloudera.org:8080/#/c/11528/2/.gitignore File .gitignore: http://gerrit.cloudera.org:8080/#/c/11528/2/.gitignore@48 PS2, Line 48: > I'm not a stickler, but you could separate the .gitignore stuff into a sepa Reverted all changes to this file. http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java: http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@111 PS8, Line 111: public ArrayList<Expr> getGroupList() { return groupingExprs_; } > where is this used in this change? Was used in a test, but that test ended up not being include as part of this patch, so removed the line. http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java File fe/src/main/java/org/apache/impala/analysis/SlotRef.java: http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@102 PS2, Line 102: private void computeNdv() { > line has trailing whitespace Ack http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@105 PS2, Line 105: // Potentially adjust NDV for nulls if the column is has stats, > This used to execute even when hasStats() was false. What's numDistinctValu Defaults to -1. Put it back as it was. http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@115 PS2, Line 115: if (desc_.getStats().hasStats()) { > Comment about why getPath()? Ack http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@152 PS2, Line 152: } : : computeNdv(); : : // If this column is for a base table, limit NDV to the number : / > It seems a little weird that numDistinctValues_ is modified both in compute Yes. As it turns out, there are two paths to build one of these things. One path has a base table, and get analyzed, the other does not. If we look carefully, it turns out that the resolved path is available only here, not in the constructor for that other path. Wanted to avoid duplicating adjustment, so factored it out, but did not want to move more than necessary. http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/main/java/org/apache/impala/analysis/SlotRef.java File fe/src/main/java/org/apache/impala/analysis/SlotRef.java: http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@41 PS8, Line 41: add value > nit: lots of 'value' in this comment. perhaps replace this with "make sense Ack http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java File fe/src/main/java/org/apache/impala/service/Frontend.java: http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@185 PS2, Line 185: public static class PlanCtx { > This line is extra. I saw a similar one somewhere else; we don't typically Ack http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@1043 PS2, Line 1043: if (planCtx.planCaptureRequested()) { > I think "planCtx.shouldCapturePlan()" reads slightly better. Or perhaps pla Ack http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@1092 PS2, Line 1092: parser.setQueryOptions(options); > Are there still users of this one? Yes. Several in tests, and another in "jniFrontend". Didn't want to expand scope by changing those also. http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@1116 PS2, Line 1116: result.setTimeline(timeline.toThrift()); > Perhaps getQueryContext() would be clearer here? Ack http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/main/java/org/apache/impala/service/Frontend.java@1142 PS2, Line 1142: attempt, INCONSISTENT_METADATA_NUM_RETRIES) > line too long (108 > 90) Ack http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java: http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java@132 PS2, Line 132: verifyNdv("id * int_col", 7300); > mention there JIRA where you filed this as a bug? Ack http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java@141 PS2, Line 141: > There's code in the frontend that does this, no? Is there a test for that c Since expressions are computed incorrectly, there is no case where we an exceed row count now. If we did calcs correctly (multiply out NDVs for *; etc.) then we'd need to test this. To avoid discussion of hypothetical improvements, simply removed the comment. http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: http://gerrit.cloudera.org:8080/#/c/11528/2/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@180 PS2, Line 180: planCtx.requestPlanCapture(); > line too long (93 > 90) Ack http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@31 PS6, Line 31: verifyCardinality("SELECT id FROM functional.alltypes", 7300); > some of these have semicolons and some don't; perhaps remove all of them? Ack http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@153 PS6, Line 153: "SELECT COUNT(*)" + joinClause + "GROUP BY t1.id", 7300); > consider renaming to "expectCardinality(query, expected)" which helps make Ack http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@156 PS6, Line 156: } > Maybe add the query string to the message here? Ack http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@168 PS6, Line 168: expected, planRoot.getCardinality()); > planCtx.setCapturePlan(true)? Ack http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@172 PS6, Line 172: // Set up the query context. Note that we need to deep copy it before > nit: remove Ack http://gerrit.cloudera.org:8080/#/c/11528/6/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@175 PS6, Line 175: "default", System.getProperty("user.name")); > Why bother? The test will fail with the exception if it's thrown. Sure. Just copying another test here. I suppose the argument is that, without this try/catch block, every test will throw the ImpalaException. I usually do that as a way of saying that the exception is unexpected, but I wondered if Impala had conventions for this kind of thing. http://gerrit.cloudera.org:8080/#/c/11528/7/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: http://gerrit.cloudera.org:8080/#/c/11528/7/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@111 PS7, Line 111: > line too long (91 > 90) Ack http://gerrit.cloudera.org:8080/#/c/11528/7/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@143 PS7, Line 143: /** > line too long (96 > 90) Ack http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@22 PS8, Line 22: private static final boolean DEBUG_MODE = false; > use a logger. This was more to help write the test, no need to keep it. Removed the code. http://gerrit.cloudera.org:8080/#/c/11528/8/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@190 PS8, Line 190: if (DEBUG_MODE) { : System.out.println(plan.get(0).getExplainString( : queryOptions, TExplainLevel.EXTENDED)); : } > use a logger Ack http://gerrit.cloudera.org:8080/#/c/11528/8/testdata/NullTable/large_data.csv File testdata/NullTable/large_data.csv: http://gerrit.cloudera.org:8080/#/c/11528/8/testdata/NullTable/large_data.csv@1 PS8, Line 1: a,,\N,\N,\N,a > why's this called "large_data"? Because the directory already contains a "data.csv" file. Couldn't justify creating another directory, given that the tables have the same schema, but different amounts of data. So, compared to a single-line version, this 26-line version is "large." Should I instead create a separate directory with its own "data.csv" file? http://gerrit.cloudera.org:8080/#/c/11528/8/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/11528/8/testdata/datasets/functional/functional_schema_template.sql@1355 PS8, Line 1355: ---- CREATE_KUDU > where's this used for kudu tests? fwict, the refs to this new table are exp Sorry, just copied the one-line version of this file. Can't claim I understand the syntax of this file, and could find no documentation to help. Which is more important: having the same schema in both Kudu and HDFS, or only including tables that are actually referenced, even if that means the schemas differ across systems? -- To view, visit http://gerrit.cloudera.org:8080/11528 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ife657a43c9cafc451bd12ddf857dcb7169e97459 Gerrit-Change-Number: 11528 Gerrit-PatchSet: 8 Gerrit-Owner: Paul Rogers <par0...@yahoo.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Paul Rogers <par0...@yahoo.com> Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com> Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com> Gerrit-Comment-Date: Fri, 28 Sep 2018 23:24:53 +0000 Gerrit-HasComments: Yes