soumyakanti3578 commented on code in PR #5196: URL: https://github.com/apache/hive/pull/5196#discussion_r1638520483
########## iceberg/iceberg-handler/src/test/results/positive/delete_iceberg_mixed.q.out: ########## @@ -84,17 +84,17 @@ Stage-4 <-Reducer 2 [CONTAINS] File Output Operator [FS_46] table:{"name:":"default.ice01"} - Select Operator [SEL_44] (rows=3 width=295) + Select Operator [SEL_44] (rows=3 width=266) Output:["_col0","_col1","_col2","_col3","_col4","_col5"] - Merge Join Operator [MERGEJOIN_43] (rows=3 width=295) + Merge Join Operator [MERGEJOIN_43] (rows=3 width=266) Conds:RS_57._col4=RS_63._col0(Left Semi),Output:["_col0","_col1","_col2","_col3","_col4","_col5"] <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_57] PartitionCols:_col4 - Select Operator [SEL_55] (rows=5 width=295) + Select Operator [SEL_55] (rows=6 width=280) Review Comment: It looks like there should be 6 rows instead of 5. Here's the insert statement, `insert into ice01 values (1, 'ABC'),(2, 'CBS'),(3, null),(4, 'POPI'),(5, 'AQWR'),(6, 'POIU'),(9, null),(8, 'POIKL'),(10, 'YUIO');` And the delete predicate is: `where id>4 OR id=2`. That gives us 6 rows with ids 2, 5, 6, 9, 8, 10. Also, I believe row counts are just estimated based on stats? I could be wrong though. ########## packaging/pom.xml: ########## @@ -173,6 +173,10 @@ \Qhttps://www.eclipse.org/org/documents/epl-v10.php\E \Qhttp://www.eclipse.org/org/documents/epl-v10.php\E </EPL-1.0> + <EPL-2.0> + \Qhttps://raw.githubusercontent.com/locationtech/jts/master/LICENSE_EPLv2.txt\E + \Qhttps://www.eclipse.org/org/documents/epl-2.0/EPL-2.0.txt\E + </EPL-2.0> Review Comment: Without this the compilation fails with: `21:54:57 [ERROR] Failed to execute goal org.codehaus.mojo:license-maven-plugin:2.3.0:download-licenses (license-fetch) on project hive-packaging: Execution license-fetch of goal org.codehaus.mojo:license-maven-plugin:2.3.0:download-licenses failed: URL 'https://raw.githubusercontent.com/locationtech/jts/master/LICENSE_EPLv2.txt' should belong to licenseUrlFileName having key 'eclipse-public-license-v.-2.0-epl-2.0.txt' together with URLs 'https://www.eclipse.org/org/documents/epl-2.0/EPL-2.0.txt' -> [Help 1] ` I'm not sure if there's another way to resolve this. ########## ql/src/java/org/apache/hadoop/hive/ql/parse/type/HiveFunctionHelper.java: ########## @@ -322,18 +324,24 @@ public RexNode getExpression(String functionText, FunctionInfo fi, // unix_timestamp(args) -> to_unix_timestamp(args) calciteOp = HiveToUnixTimestampSqlOperator.INSTANCE; } - expr = rexBuilder.makeCall(returnType, calciteOp, inputs); + expr = getExpression(returnType, calciteOp, inputs); } if (expr instanceof RexCall && !expr.isA(SqlKind.CAST)) { RexCall call = (RexCall) expr; expr = rexBuilder.makeCall(returnType, call.getOperator(), RexUtil.flatten(call.getOperands(), call.getOperator())); } - return expr; } + private RexNode getExpression(RelDataType returnType, SqlOperator calciteOp, List<RexNode> inputs) { + if (Objects.requireNonNull(calciteOp.getKind()) == SqlKind.IN) { Review Comment: I think non null check was added by mistake, and it should work without this check. I will remove this. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSortPredicates.java: ########## @@ -198,6 +203,16 @@ public Double visitCall(RexCall call) { return null; } + if (call.getKind() == SqlKind.SEARCH) { + RexCall expandedCall = (RexCall) RexUtil.expandSearch( + new RexBuilder(new JavaTypeFactoryImpl(new HiveTypeSystemImpl())), Review Comment: I will do this. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePointLookupOptimizerRule.java: ########## @@ -179,7 +178,7 @@ protected HivePointLookupOptimizerRule( this.minNumORClauses = minNumORClauses; } - public RexNode analyzeRexNode(RexBuilder rexBuilder, RexNode condition) { + public static RexNode analyzeRexNode(RexBuilder rexBuilder, RexNode condition, int minNumORClauses) { Review Comment: Within `HivePointLookupOptimizerRule#analyzeRexNode` we try to merge IN expressions and create NOT/BETWEEN if possible. Right now with Calcite 1.33 there's no way to create these expressions as both INs and BETWEENs are replaced with SEARCH. In this whole PR, in a few places we are trying to convert SEARCH back to IN and BETWEEN. Anytime `sarg.isPoints()` returns false, we can use `analyzeRexNode` to convert back to BETWEEN or NOT BETWEEN. For this I had to make it static. Here's a snippet from `ASTConverter` for example: ``` case SEARCH: ASTNode astNode = call.getOperands().get(0).accept(this); astNodeLst.add(astNode); RexLiteral literal = (RexLiteral) call.operands.get(1); Sarg<?> sarg = Objects.requireNonNull(literal.getValueAs(Sarg.class), "Sarg"); int minOrClauses = SessionState.getSessionConf().getIntVar(HiveConf.ConfVars.HIVE_POINT_LOOKUP_OPTIMIZER_MIN); // convert Sarg to IN when they are points. if (sarg.isPoints()) { // just expand SEARCH to ORs when point count is less than HIVE_POINT_LOOKUP_OPTIMIZER_MIN if (sarg.pointCount < minOrClauses) { return visitCall((RexCall) call.accept(RexUtil.searchShuttle(rexBuilder, null, -1))); } // else convert to IN for (Range<?> range : sarg.rangeSet.asRanges()) { astNodeLst.add(visitLiteral((RexLiteral) rexBuilder.makeLiteral( range.lowerEndpoint(), literal.getType(), true, true))); } ASTNode inAST = SqlFunctionConverter.buildAST(HiveIn.INSTANCE, astNodeLst, call.getType()); if (sarg.nullAs == RexUnknownAs.UNKNOWN) { return inAST; } else if (sarg.nullAs == RexUnknownAs.TRUE) { ASTNode isNull = visitCall((RexCall) rexBuilder.makeCall(SqlStdOperatorTable.IS_NULL, call.getOperands().get(0))); return SqlFunctionConverter.buildAST(SqlStdOperatorTable.OR, Arrays.asList(isNull, inAST), call.getType()); } else { ASTNode isNotNull = visitCall((RexCall) rexBuilder.makeCall(SqlStdOperatorTable.IS_NOT_NULL, call.getOperands().get(0))); return SqlFunctionConverter .buildAST(SqlStdOperatorTable.AND, Arrays.asList(isNotNull, inAST), call.getType()); } // Expand SEARCH operator } else { return visitCall((RexCall) HivePointLookupOptimizerRule .analyzeRexNode( rexBuilder, call.accept(RexUtil.searchShuttle(rexBuilder, null, -1)), minOrClauses ) ); } ``` ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java: ########## @@ -656,7 +658,9 @@ public static ImmutableList<RexNode> getPredsNotPushedAlready(Set<String> predic } // Build map to not convert multiple times, further remove already included predicates Map<String,RexNode> stringToRexNode = Maps.newLinkedHashMap(); + RexSimplify simplify = new RexSimplify(inp.getCluster().getRexBuilder(), RelOptPredicateList.EMPTY, RexUtil.EXECUTOR); for (RexNode r : predsToPushDown) { + r = simplify.simplify(r); Review Comment: Without this we get into StackOverflow: https://github.com/apache/hive/pull/5196/commits/7f97e5aea3fd3370350a29606853c1c381792a4c I checked again and we still get into the issue without this. Maybe we need to think of an alternate solution given that I had to exclude `HivePreFilteringRule` from a test because of a stack overflow too. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/jdbc/HiveJdbcImplementor.java: ########## @@ -128,6 +129,11 @@ public HiveJdbcImplementor(SqlDialect dialect, JavaTypeFactory typeFactory) { return result(join, leftResult, rightResult); } +public Result visit(JdbcTableScan scan) { + return result(scan.jdbcTable.tableName(), + ImmutableList.of(Clause.FROM), scan, null); +} + Review Comment: This is required because `visit(JdbcTableScan scan)` was removed from Calcite in https://issues.apache.org/jira/browse/CALCITE-4640, https://github.com/apache/calcite/pull/2426/files#diff-f118416cd59a4ba4e96ffcfa0d2b496dd2b372c6f5be01f7bb4d46996b9175d1 We get a lot of errors without this. ########## ql/src/test/queries/clientpositive/cardinality_preserving_join_opt2.q: ########## @@ -2,6 +2,7 @@ set hive.support.concurrency=true; set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.strict.checks.cartesian.product=false; set hive.stats.fetch.column.stats=true; +set hive.cbo.rule.exclusion.regex=HivePreFilteringRule; Review Comment: No, this should definitely be not excluded in production. Unfortunately, without this exclusion, this test would run for a long time and throw a StackOverflow. I excluded it just so that we get past this error and get a green run. If possible, please look into this test failure, as I wasn't able to resolve this: http://ci.hive.apache.org/job/hive-precommit/job/PR-5196/27/testReport/junit/org.apache.hadoop.hive.cli.split3/TestMiniLlapLocalCliDriver/Testing___split_22___PostProcess___testCliDriver_cardinality_preserving_join_opt2_/ ########## ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java: ########## @@ -508,8 +492,19 @@ public void testComputeRangePredicateSelectivityBetweenWithNULLS() { public void testComputeRangePredicateSelectivityNotBetweenWithNULLS() { doReturn((double) 20).when(tableMock).getRowCount(); doReturn(Collections.singletonList(stats)).when(tableMock).getColStat(Collections.singletonList(0)); - RexNode filter = REX_BUILDER.makeCall(SqlStdOperatorTable.BETWEEN, boolTrue, inputRef0, int1, int3); + RexNode filter = makeNotBetween(inputRef0, int1, int3); FilterSelectivityEstimator estimator = new FilterSelectivityEstimator(scan, mq); Assert.assertEquals(0.55, estimator.estimateSelectivity(filter), DELTA); } + + private RexNode makeNotBetween(RexNode inputRef, RexNode left, RexNode right) { + RexNode withOr = REX_BUILDER.makeCall( + SqlStdOperatorTable.OR, + REX_BUILDER.makeCall(SqlStdOperatorTable.LESS_THAN, inputRef, left), + REX_BUILDER.makeCall(SqlStdOperatorTable.GREATER_THAN, inputRef, right) + ); + RexSimplify simplify = new RexSimplify(REX_BUILDER, RelOptPredicateList.EMPTY, RexUtil.EXECUTOR); + + return simplify.simplify(withOr); Review Comment: With what you have suggested, it'll create a REXCALL like `NOT(SEARCH...)`, which goes to `computeNotEqualitySelectivity` in `FilterSelectivityEstimator`. maxNDV is computed there but 1.0 is always returned when the value is less than 1. This gives wrong result, even if we add an else-if block in `getMaxNDV`: ``` else if (op.isA(SqlKind.SEARCH)) { tmpNDV = op.accept(this); if (tmpNDV == null) { return null; } if (tmpNDV > maxNDV) { maxNDV = tmpNDV; } } else { irv = new InputReferencedVisitor(); irv.apply(op); ... ... ``` With my patch, it creates a REXCALL with SEARCH operator, without the NOT. And it directly matches to `case SEARCH`. Moreover, we cannot create a BETWEEN with `makeBetween` when `left > right`, so this test will fail with what you've suggested: ``` public void testComputeRangePredicateSelectivityNotBetweenRightLowerThanLeft() { RexNode filter = makeNotBetween(inputRef0, int5, int3); FilterSelectivityEstimator estimator = new FilterSelectivityEstimator(scan, mq); Assert.assertEquals(1, estimator.estimateSelectivity(filter), DELTA); } ``` Let me know what you think about this. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java: ########## @@ -171,7 +176,36 @@ public Double visitCall(RexCall call) { break; } - default: + case SEARCH: + RexLiteral literal = (RexLiteral) call.operands.get(1); + Sarg<?> sarg = Objects.requireNonNull(literal.getValueAs(Sarg.class), "Sarg"); + RexBuilder rexBuilder = new RexBuilder(new JavaTypeFactoryImpl(new HiveTypeSystemImpl())); Review Comment: I will change this. ########## ql/src/test/org/apache/hadoop/hive/ql/io/TestSymlinkTextInputFormat.java: ########## @@ -157,7 +157,7 @@ public void testCombine() throws Exception { drv.run(loadFileCommand); - String cmd = "select key*1 from " + tblName; + String cmd = "select key*2 from " + tblName; Review Comment: I saw that the final logical plan with calcite 1.33 is a bit more reduced than what was produced in 1.25 ``` 1.33 - select key*1 ... DEBUG [main] translator.PlanModifierForASTConv: Final plan after modifier HiveProject(_c0=[$0]) HiveTableScan(table=[[default, text_symlink_text]], table:alias=[text_symlink_text]) 1.25 - select key*1 ... DEBUG [main] translator.PlanModifierForASTConv: Final plan after modifier HiveProject(_c0=[*($0, 1)]) HiveTableScan(table=[[default, text_symlink_text]], table:alias=[text_symlink_text]) ``` I think `key * 1` is getting reduced to just `key` in 1.33. If we change it to `key*2` then the test passes: ``` 1.33 - select key*2 ... DEBUG [main] translator.PlanModifierForASTConv: Final plan after modifier HiveProject(_c0=[*($0, 2)]) HiveTableScan(table=[[default, text_symlink_text]], table:alias=[text_symlink_text]) ``` Since this test is testing symlink text input format, and is not particularly testing `key * 1`, I changed it to `key * 2`. Moreover, in 1.25, when we run the same test with `select key ...`, it fails with the same error. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveFilter.java: ########## @@ -72,6 +78,22 @@ public HiveFilter(RelOptCluster cluster, RelTraitSet traits, RelNode child, RexN super(cluster, TraitsUtil.getDefaultTraitSet(cluster), child, condition); this.correlationInfos = new CorrelationInfoSupplier(getCondition()); } + + @Override + public RelWriter explainTerms(RelWriter pw) { + // Remove this method after upgrading Calcite to 1.35+ Review Comment: Is it a good idea to write project properties with `properties-maven-plugin` to `ql/src/main/resources/`, and read the calcite version from there and compare? This plugin is already in use in `itests`. If we just define a constant CALCITE_VERSION = 1.35, it won't throw an exception when the version is higher than the constant. But if we read directly from the properties file, it'll throw an exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org