zabetak commented on code in PR #4442: URL: https://github.com/apache/hive/pull/4442#discussion_r1317011777
########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java: ########## @@ -621,6 +644,80 @@ private ASTNode pkFkHint(int fkTableIndex, boolean nonFkSideIsFiltered) { } } + private ASTNode createASTLateralView(TableFunctionScan tfs, Schema s, Review Comment: If its static let's declare it explicitly. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java: ########## @@ -383,8 +386,16 @@ private ASTNode convert() throws CalciteSemanticException { ASTBuilder sel = ASTBuilder.construct(HiveParser.TOK_SELEXPR, "TOK_SELEXPR"); ASTNode function = buildUDTFAST(call.getOperator().getName(), children); sel.add(function); - for (String alias : udtf.getRowType().getFieldNames()) { - sel.add(HiveParser.Identifier, alias); + + // When we are treating a UDTF as a 'select', it is not a lateral view. + // In this case, it means that only the fields from the UDTF are selected + // out of the RelNode and placed into the SELEXPR. So we want to ignore + // any field in the inputRef mapping. + List<String> fields = udtf.getRowType().getFieldNames(); + for (int i = 0; i < udtf.getRowType().getFieldCount(); ++i) { + if (!udtf.containsInputRefMapping(i)) { Review Comment: With the newly introduced `lateral` operator this `if` check and the related code checking for containment may be redundant. Can we get here with more fields than necessary? ########## ql/src/test/results/clientnegative/udf_assert_true2.q.out: ########## @@ -36,7 +36,7 @@ STAGE PLANS: Limit Number of rows: 2 Select Operator - expressions: (1 + assert_true((_col6 < 2))) (type: int) + expressions: (1 + UDFToInteger(assert_true((_col6 < 2)))) (type: int) Review Comment: What caused this change? It looks a bit weird. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterTableFunctionTransposeRule.java: ########## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptUtil; +import org.apache.calcite.rel.RelNode; +import org.apache.calcite.rel.core.Filter; +import org.apache.calcite.rel.metadata.RelColumnMapping; +import org.apache.calcite.rex.RexCall; +import org.apache.calcite.rex.RexNode; +import org.apache.calcite.rex.RexUtil; +import org.apache.calcite.tools.RelBuilder; +import org.apache.calcite.tools.RelBuilderFactory; +import org.apache.commons.collections4.CollectionUtils; +import org.apache.hadoop.hive.ql.exec.FunctionRegistry; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveFilter; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveTableFunctionScan; + +import com.google.common.base.Preconditions; +import com.google.common.collect.ImmutableList; + +import java.util.ArrayList; +import java.util.List; +import java.util.Set; + +/** + * Rule to transpose Filter and TableFunctionScan RelNodes + * + * This differs from the FilterTableFunctionTransposeRule for a number of reasons: + * - We check if the filter condition is deterministic under Hive rules. We cannot + * push filters that aren't deterministic. + * - We check if the filter only references columns that are not part of the udtf. + * If the input ref in the filter references a column that is generated by the + * udtf, the filter cannot be pushed through. Review Comment: Thanks for the analysis. Please do the following: 1. Log a CALCITE ticket reporting the problem with `Logical` operators in `FilterTableFunctionTransposeRule`. 2. Add an entry in `org.apache.hadoop.hive.ql.optimizer.calcite.Bug` 3. Reference the new entry somewhere in the rule similarly to how the entries are references elsewhere. 4. Update the Javadoc since as far as I understand this is the main reason that we cannot use the respective Calcite rule. ########## ql/src/test/queries/clientpositive/lateral_view.q: ########## @@ -10,6 +10,11 @@ EXPLAIN SELECT myTable.* FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS EXPLAIN SELECT myTable.myCol, myTable2.myCol2 FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS myCol LATERAL VIEW explode(array('a', 'b', 'c')) myTable2 AS myCol2 LIMIT 9; EXPLAIN SELECT myTable2.* FROM src LATERAL VIEW explode(array(array(1,2,3))) myTable AS myCol LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2 LIMIT 3; +EXPLAIN CBO SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS myCol SORT BY key ASC, myCol ASC LIMIT 1; Review Comment: If the goal of adding `EXPLAIN CBO` in every `LATERAL VIEW` query was to ensure that CBO doesn't fail then maybe we could enforce this differently (e.g., `set hive.cbo.fallback.strategy=never`). ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java: ########## @@ -621,6 +644,80 @@ private ASTNode pkFkHint(int fkTableIndex, boolean nonFkSideIsFiltered) { } } + private ASTNode createASTLateralView(TableFunctionScan tfs, Schema s, + QueryBlockInfo tableFunctionSource, String sqAlias) { + // The structure of the AST LATERAL VIEW will be: + // + // TOK_LATERAL_VIEW + // TOK_SELECT + // TOK_SELEXPR + // TOK_FUNCTION + // <udtf func> + // ... + // <col alias for function> + // TOK_TABALIAS + // <table alias for lateral view> + + // set up the select for the parameters of the UDTF + List<ASTNode> children = new ArrayList<>(); + // The UDTF function call within the table function scan will be of the form: + // lateral(my_udtf_func(...), $0, $1, ...). For recreating the AST, we need + // the inner "my_udtf_func". + RexCall lateralCall = (RexCall) tfs.getCall(); + RexCall call = (RexCall) lateralCall.getOperands().get(0); + for (RexNode rn : call.getOperands()) { + ASTNode expr = rn.accept(new RexVisitor(s, rn instanceof RexLiteral, + select.getCluster().getRexBuilder())); + children.add(expr); + } + ASTNode function = buildUDTFAST(call.getOperator().getName(), children); + + // Add the function to the SELEXPR + ASTBuilder selexpr = ASTBuilder.construct(HiveParser.TOK_SELEXPR, "TOK_SELEXPR"); + selexpr.add(function); + + // Add only the table generated size columns to the select expr for the function, + // skipping over the base table columns from the input side of the join. + int i = 0; + for (ColumnInfo c : s) { + if (i++ < tableFunctionSource.schema.size()) { + continue; + } + selexpr.add(HiveParser.Identifier, c.column); + } + // add the table alias for the lateral view. + ASTBuilder tabAlias = ASTBuilder.construct(HiveParser.TOK_TABALIAS, "TOK_TABALIAS"); + tabAlias.add(HiveParser.Identifier, sqAlias); + + // add the table alias to the SEL_EXPR + selexpr.add(tabAlias.node()); + + // create the SELECT clause + ASTBuilder sel = ASTBuilder.construct(HiveParser.TOK_SELEXPR, "TOK_SELECT"); + sel.add(selexpr.node()); + + // place the SELECT clause under the LATERAL VIEW clause + ASTBuilder lateralview = ASTBuilder.construct(HiveParser.TOK_LATERAL_VIEW, "TOK_LATERAL_VIEW"); + lateralview.add(sel.node()); + + // finally, add the LATERAL VIEW clause under the left side source which is the base table. + lateralview.add(tableFunctionSource.ast); + + return lateralview.node(); + } + + /** + * Check to see if we can optimize out the lateral view operators + * We do not need to use the lateral view syntax if all of the fields + * selected out of the table scan come from the UDTF call. No join + * is needed because all the fields come from the table level rather + * than the row level. + */ Review Comment: The comment seems obsolete since the method does not check any fields. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java: ########## @@ -577,6 +588,18 @@ private QueryBlockInfo convertSource(RelNode r) throws CalciteSemanticException ast = ASTBuilder.subQuery(left, sqAlias); s = new Schema((Union) r, sqAlias); } + } else if (r instanceof HiveTableFunctionScan && Review Comment: We can put the `instanceof` check inside the `isLateralView` method and simplify the code here to be simply `isLateralView(r)`. ########## ql/src/test/results/clientpositive/llap/tablevalues.q.out: ########## @@ -175,14 +175,27 @@ STAGE PLANS: TableScan alias: mytbl_n1 Select Operator - expressions: array(struct(key,value,'A',10,key),struct(key,value,'B',20,key)) (type: array<struct<col1:string,col2:string,col3:string,col4:int,col5:string>>) + expressions: key (type: string) outputColumnNames: _col0 - UDTF Operator - function name: inline + Lateral View Forward Select Operator - expressions: col3 (type: string), col4 (type: int), col5 (type: string) - outputColumnNames: _col0, _col1, _col2 - ListSink + Lateral View Join Operator + outputColumnNames: _col2, _col3, _col4 + Select Operator + expressions: _col2 (type: string), _col3 (type: int), _col4 (type: string) + outputColumnNames: _col0, _col1, _col2 + ListSink + Select Operator + expressions: array(struct('A',10,_col0),struct('B',20,_col0)) (type: array<struct<col1:string,col2:int,col3:string>>) + outputColumnNames: _col0 + UDTF Operator + function name: inline + Lateral View Join Operator + outputColumnNames: _col2, _col3, _col4 + Select Operator + expressions: _col2 (type: string), _col3 (type: int), _col4 (type: string) + outputColumnNames: _col0, _col1, _col2 + ListSink Review Comment: This seems like a regression. Probably it was caused by the introduction of lateral in `HiveTableFunctionScan`. Do we need to revisit that choice? ########## ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java: ########## @@ -1024,30 +1022,8 @@ boolean isCBOExecuted() { @Override boolean isCBOSupportedLateralView(ASTNode lateralView) { - // Lateral view AST has the following shape: - // ^(TOK_LATERAL_VIEW - // ^(TOK_SELECT ^(TOK_SELEXPR ^(TOK_FUNCTION Identifier params) identifier* tableAlias))) - if (lateralView.getToken().getType() == HiveParser.TOK_LATERAL_VIEW_OUTER) { - // LATERAL VIEW OUTER not supported in CBO - return false; - } - // Only INLINE followed by ARRAY supported in CBO - ASTNode lvFunc = (ASTNode) lateralView.getChild(0).getChild(0).getChild(0); - String lvFuncName = lvFunc.getChild(0).getText(); - if (lvFuncName.compareToIgnoreCase( - GenericUDTFInline.class.getAnnotation(Description.class).name()) != 0) { - return false; - } - if (lvFunc.getChildCount() != 2) { - return false; - } Review Comment: I believe the `if (lvFunc.getChildCount() != 2)` check was redundant anyways since at this line we already checked that the function is `GenericUDTFInline` and by definition it cannot have more than one input. Since here we are reasoning at the AST level I believe the first child corresponds to the function name token and the second child to the token representing the input argument to the inline function. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveOptimizeInlineArrayTableFunctionRule.java: ########## @@ -124,9 +125,12 @@ public void onMatch(RelOptRuleCall call) { RexNode newInlineCall = cluster.getRexBuilder().makeCall(tfs.getRowType(), inlineCall.op, newArrayCall); + // Use empty listfor columnMappings. The return row type of the RelNode now comprises of + // all the fields within the UDTF, so there is no mapping from the output fields + // directly to the input fields anymore. final RelNode newTableFunctionScanNode = tfs.copy(tfs.getTraitSet(), tfs.getInputs(), newInlineCall, tfs.getElementType(), tfs.getRowType(), - tfs.getColumnMappings()); + Collections.emptySet()); call.transformTo(newTableFunctionScanNode); Review Comment: Does this rule still work as it is now that we are wrapping the table function with the `LATERAL` operator? Do we want/need to add special cases if we have lateral + inline array? ########## ql/src/java/org/apache/hadoop/hive/ql/ppd/OpProcFactory.java: ########## @@ -504,27 +504,32 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx, // We try to push the full Filter predicate iff: // - the Filter is on top of a TableScan, or // - the Filter is on top of a PTF (between PTF and Filter, there might be Select operators) + // - the Filter is on top of a LateralViewJoinOperator (the filter can be pushed through one + // side of the join with the base table predicate, but not the UDTF side.) Review Comment: I don't fully understand this optimization. The comment says that we want to fully push the filter if it is on top of a LateralViewJoinOperator but then it seems that we have restrictions on which side of the join we are going to push the filter. Are these restrictions enforced somewhere later on? ########## ql/src/test/queries/clientpositive/lateral_view.q: ########## @@ -10,6 +10,11 @@ EXPLAIN SELECT myTable.* FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS EXPLAIN SELECT myTable.myCol, myTable2.myCol2 FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS myCol LATERAL VIEW explode(array('a', 'b', 'c')) myTable2 AS myCol2 LIMIT 9; EXPLAIN SELECT myTable2.* FROM src LATERAL VIEW explode(array(array(1,2,3))) myTable AS myCol LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2 LIMIT 3; +EXPLAIN CBO SELECT * FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS myCol SORT BY key ASC, myCol ASC LIMIT 1; Review Comment: I suggested to create a new .q file (`cbo_lateral_views.q`) and add test cases there to put more focus on what we really want to test as part of this change. We want to verify that the CBO plan for certain types of queries is the expected one. Determining the types of queries is the whole point of having a dedicated .q file. **Examples** Do we care about LIMIT clause and the actual number there? Probably not so we don't need to add EXPLAIN CBO for that one. Do we care about how many times we have `LATERAL VIEW explode(array(1,2,3))` in the query? Probably not, one or two should suffice. ```sql --Q1 SELECT myTable.myCol FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS myCol; --Q2 SELECT myTable.myCol, myTable2.myCol2 FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS myCol LATERAL VIEW explode(array(1,2,3)) myTable2 AS myCol2; --Q3 SELECT myTable.myCol, myTable2.myCol2, myTable3.myCol3 FROM src LATERAL VIEW explode(array(1,2,3)) myTable AS myCol LATERAL VIEW explode(array(1,2,3)) myTable2 AS myCol2 LATERAL VIEW explode(array(1,2,3)) myTable3 AS myCol3; ``` If the CBO plan is correct for Q1 and Q2 there is no reason to be wrong for Q3, Q4, etc. Do we care about what is the actual table function inside the `LATERAL VIEW` clause. Probably not, we don't need to test every possible variation of `explode`, `inline`, etc. Do we want our tables to have data and thus use the `src` and other similar tables? Probably not. All that to say that having many EXPLAIN CBO statements does not necessarily increase test coverage and has some drawbacks as well. Picking representatives queries and putting them in a file seems like a better alternative. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java: ########## @@ -577,6 +588,18 @@ private QueryBlockInfo convertSource(RelNode r) throws CalciteSemanticException ast = ASTBuilder.subQuery(left, sqAlias); s = new Schema((Union) r, sqAlias); } + } else if (r instanceof HiveTableFunctionScan && Review Comment: Maybe it suffices to check for the super-class type `TableFunctionScan` since it is the one used in the cast just below. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org