Re: [PR] HIVE-28572: Support Distribute by and Cluster by clauses in CBO [hive]

via GitHub Thu, 16 Jan 2025 05:32:30 -0800


zabetak commented on code in PR #5505:
URL: https://github.com/apache/hive/pull/5505#discussion_r1918056949



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSortTransposeRule.java:
##########
@@ -22,31 +22,36 @@
 import org.apache.calcite.rel.RelNode;
 import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
 import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveFilter;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortExchange;
 import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit;
 
 import com.google.common.collect.ImmutableList;
 
-public class HiveFilterSortTransposeRule extends RelOptRule {
+import static java.util.Collections.singletonList;
 
-  public static final HiveFilterSortTransposeRule INSTANCE =
-      new HiveFilterSortTransposeRule();
+public class HiveFilterSortTransposeRule<T extends RelNode> extends RelOptRule 
{
+
+  public static final HiveFilterSortTransposeRule<HiveSortLimit> 
SORT_LIMIT_INSTANCE =
+      new HiveFilterSortTransposeRule<>(HiveSortLimit.class);
+  public static final HiveFilterSortTransposeRule<HiveSortExchange> 
SORT_EXCHANGE_INSTANCE =
+      new HiveFilterSortTransposeRule<>(HiveSortExchange.class);
 
   //~ Constructors -----------------------------------------------------------
 
   /**
    * Creates a HiveFilterSortTransposeRule.
    */
-  private HiveFilterSortTransposeRule() {
+  private HiveFilterSortTransposeRule(Class<T> clazz) {

Review Comment:
   It doesn't seem necessary to make the class generic. We could simply declare 
the argument as `Class<? extends RelNode>` with the same type safety guarantees 
as now.



##########
ql/src/test/queries/clientpositive/reducesink_dedup.q:
##########
@@ -1,3 +1,4 @@
+-- SORT_QUERY_RESULTS
 --! qt:dataset:part
 --! qt:dataset:src
 select p_name

Review Comment:
   The results of this query haven't changed since 2019 so I am wondering why 
are they changing now? Did the underlying plan change?



##########
ql/src/test/queries/clientpositive/distributeby.q:
##########
@@ -0,0 +1,24 @@
+create table t1 (a string, b int);
+
+insert into t1 values ('2014-03-14 10:10:12', 10);
+
+-- distribute by
+explain cbo
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14' 
distribute by a;
+explain
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14' 
distribute by a;
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14' 
distribute by a;

Review Comment:
   It seems that for this query we verify the results but not for the ones that 
follow. I would think that there are already tests that cover the 
correctness/results of these clauses. Do need new ones?



##########
ql/src/test/results/clientpositive/llap/mapreduce5.q.out:
##########
@@ -45,7 +45,7 @@ STAGE PLANS:
                   alias: src
                   Statistics: Num rows: 500 Data size: 89000 Basic stats: 
COMPLETE Column stats: COMPLETE
                   Select Operator
-                    expressions: key (type: string), UDFToInteger((key / 10)) 
(type: int), UDFToInteger((key % 10)) (type: int), value (type: string)
+                    expressions: key (type: string), 
UDFToInteger((UDFToDouble(key) / 10.0D)) (type: int), 
UDFToInteger((UDFToDouble(key) % 10.0D)) (type: int), value (type: string)

Review Comment:
   I see various places where `UDFtoDouble` is introduced. I guess the type 
casting rules are different with and without CBO. I guess this is also 
something that may cause result changes after the PR gets in.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java:
##########
@@ -995,12 +995,6 @@ private static String canHandleQbForCbo(QueryProperties 
queryProperties,
                                           HiveConf conf, boolean topLevelQB) {
     List<String> reasons = new ArrayList<>();
     // Not ok to run CBO, build error message.
-    if (queryProperties.hasClusterBy()) {
-      reasons.add("has cluster by");
-    }
-    if (queryProperties.hasDistributeBy()) {
-      reasons.add("has distribute by");
-    }
     if (queryProperties.hasSortBy() && queryProperties.hasLimit()) {
       reasons.add("has sort by with limit");
     }

Review Comment:
   What does it mean sort by with limit? Why we cannot handle it?



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java:
##########
@@ -5611,4 +5365,282 @@ private enum TableType {
     JDBC
   }
 
+  private class OrderByRelBuilder {

Review Comment:
   Since a lot of code has been refactored and changed place as part of this 
change it may be a good opportunity to extract this refactored class completely 
outside of the `CalcitePlanner` if that is possible (keeping it package 
private).



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSortTransposeRule.java:
##########
@@ -22,31 +22,36 @@
 import org.apache.calcite.rel.RelNode;
 import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
 import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveFilter;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortExchange;
 import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit;
 
 import com.google.common.collect.ImmutableList;
 
-public class HiveFilterSortTransposeRule extends RelOptRule {
+import static java.util.Collections.singletonList;
 
-  public static final HiveFilterSortTransposeRule INSTANCE =
-      new HiveFilterSortTransposeRule();
+public class HiveFilterSortTransposeRule<T extends RelNode> extends RelOptRule 
{

Review Comment:
   Adding generics does not seem to improve the type safety of the class or its 
usages. On the negative side it makes rule declarations more verbose.



##########
ql/src/test/queries/clientpositive/distributeby.q:
##########
@@ -0,0 +1,24 @@
+create table t1 (a string, b int);
+
+insert into t1 values ('2014-03-14 10:10:12', 10);
+
+-- distribute by
+explain cbo
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14' 
distribute by a;

Review Comment:
   Why is there a WHERE clause and `date_add` functions? Are we intend to test 
something more than just the `DISTRIBUTE BY` clause? If not relevant then 
please remove them. If they are relevant then explain why we don't need the 
same tests for the rest of the queries.
   
   
   



##########
ql/src/test/queries/clientpositive/distributeby_cboret.q:
##########
@@ -0,0 +1,26 @@
+set hive.cbo.returnpath.hiveop=true;
+
+create table t1 (a string, b int);
+
+insert into t1 values ('2014-03-14 10:10:12', 10);
+
+-- distribute by
+explain cbo
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14' 
distribute by a;

Review Comment:
   I don't see the CBO plans in the .q.out file. Is this normal?



##########
ql/src/test/queries/clientpositive/distributeby.q:
##########
@@ -0,0 +1,24 @@
+create table t1 (a string, b int);
+
+insert into t1 values ('2014-03-14 10:10:12', 10);
+
+-- distribute by
+explain cbo
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14' 
distribute by a;
+explain
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14' 
distribute by a;
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14' 
distribute by a;
+
+-- distribute by and sort by
+explain cbo
+select * from t1 distribute by a, b sort by a;

Review Comment:
   I think Hive allows to have `SORT BY` without `DISTRIBUTE BY` can we also 
add an `EXPLAIN CBO` test about this case.



##########
ql/src/test/results/clientpositive/llap/semijoin_hint.q.out:
##########
@@ -535,14 +535,14 @@ STAGE PLANS:
                 keys:
                   0 cstring (type: string)
                   1 value (type: string)
-                outputColumnNames: str
-                Statistics: Num rows: 3068 Data size: 266916 Basic stats: 
COMPLETE Column stats: COMPLETE
+                outputColumnNames: cstring, str
+                Statistics: Num rows: 3068 Data size: 336852 Basic stats: 
COMPLETE Column stats: COMPLETE
                 Reduce Output Operator
                   key expressions: str (type: string)
                   null sort order: z
                   sort order: +
-                  Map-reduce partition columns: str (type: string)
-                  Statistics: Num rows: 3068 Data size: 266916 Basic stats: 
COMPLETE Column stats: COMPLETE
+                  Map-reduce partition columns: cstring (type: string)

Review Comment:
   Are we picking another column? Why?



##########
ql/src/test/queries/clientpositive/distributeby.q:
##########
@@ -0,0 +1,24 @@
+create table t1 (a string, b int);

Review Comment:
   Since these tests are mainly for cbo changes consider renaming the file to 
`cbo_distributeby.q` or maybe `cbo_distribute_sort_cluster_by.q` since we don't 
have only distribute clause.



##########
ql/src/test/results/clientpositive/llap/input_columnarserde.q.out:
##########
@@ -54,7 +54,7 @@ STAGE PLANS:
                     Reduce Output Operator
                       null sort order: 
                       sort order: 
-                      Map-reduce partition columns: 1 (type: int)
+                      Map-reduce partition columns: _col0 (type: array<int>)

Review Comment:
   This plan change changes the semantics of the plan and could affect the 
result set. Did we fix a bug?
   
   What are the semantics of `DISTRIBUTE BY 1`? Does it mean distribute by the 
column in position 1 or does it mean distribute by the numeric constant 1?



##########
ql/src/test/results/clientpositive/llap/implicit_cast_during_insert.q.out:
##########
@@ -40,60 +40,64 @@ STAGE PLANS:
             Map Operator Tree:
                 TableScan
                   alias: src
-                  filterExpr: (key) IN (0, 1) (type: boolean)
+                  filterExpr: (UDFToDouble(key)) IN (0.0D, 1.0D) (type: 
boolean)
                   Statistics: Num rows: 500 Data size: 89000 Basic stats: 
COMPLETE Column stats: COMPLETE
                   Filter Operator
-                    predicate: (key) IN (0, 1) (type: boolean)
-                    Statistics: Num rows: 3 Data size: 534 Basic stats: 
COMPLETE Column stats: COMPLETE
+                    predicate: (UDFToDouble(key)) IN (0.0D, 1.0D) (type: 
boolean)
+                    Statistics: Num rows: 250 Data size: 44500 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Select Operator
                       expressions: value (type: string), key (type: string)
                       outputColumnNames: _col1, _col2
-                      Statistics: Num rows: 3 Data size: 534 Basic stats: 
COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 250 Data size: 44500 Basic stats: 
COMPLETE Column stats: COMPLETE
                       Reduce Output Operator
                         key expressions: _col2 (type: string)
                         null sort order: z
                         sort order: +
                         Map-reduce partition columns: _col2 (type: string)
-                        Statistics: Num rows: 3 Data size: 534 Basic stats: 
COMPLETE Column stats: COMPLETE
+                        Statistics: Num rows: 250 Data size: 44500 Basic 
stats: COMPLETE Column stats: COMPLETE
                         value expressions: _col1 (type: string)
-            Execution mode: llap
+            Execution mode: vectorized, llap
             LLAP IO: all inputs
         Reducer 2 
             Execution mode: vectorized, llap
             Reduce Operator Tree:
               Select Operator
                 expressions: UDFToInteger(KEY.reducesinkkey0) (type: int), 
VALUE._col0 (type: string), KEY.reducesinkkey0 (type: string)
                 outputColumnNames: _col0, _col1, _col2
-                Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE 
Column stats: COMPLETE
-                File Output Operator
-                  compressed: false
-                  Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE 
Column stats: COMPLETE
-                  table:
-                      input format: 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
-                      output format: 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
-                      serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
-                      name: default.implicit_cast_during_insert
+                Statistics: Num rows: 250 Data size: 45500 Basic stats: 
COMPLETE Column stats: COMPLETE
                 Select Operator
                   expressions: _col0 (type: int), _col1 (type: string), _col2 
(type: string)
                   outputColumnNames: c1, c2, p1
-                  Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE 
Column stats: COMPLETE
+                  Statistics: Num rows: 250 Data size: 45500 Basic stats: 
COMPLETE Column stats: COMPLETE
                   Group By Operator
                     aggregations: min(c1), max(c1), count(1), count(c1), 
compute_bit_vector_hll(c1), max(length(c2)), avg(COALESCE(length(c2),0)), 
count(c2), compute_bit_vector_hll(c2)
                     keys: p1 (type: string)
                     mode: complete
                     outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5, _col6, _col7, _col8, _col9
-                    Statistics: Num rows: 2 Data size: 838 Basic stats: 
COMPLETE Column stats: COMPLETE
+                    Statistics: Num rows: 250 Data size: 104750 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Select Operator
                       expressions: 'LONG' (type: string), UDFToLong(_col1) 
(type: bigint), UDFToLong(_col2) (type: bigint), (_col3 - _col4) (type: 
bigint), COALESCE(ndv_compute_bit_vector(_col5),0) (type: bigint), _col5 (type: 
binary), 'STRING' (type: string), UDFToLong(COALESCE(_col6,0)) (type: bigint), 
COALESCE(_col7,0) (type: double), (_col3 - _col8) (type: bigint), 
COALESCE(ndv_compute_bit_vector(_col9),0) (type: bigint), _col9 (type: binary), 
_col0 (type: string)
                       outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
-                      Statistics: Num rows: 2 Data size: 1234 Basic stats: 
COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 250 Data size: 154250 Basic stats: 
COMPLETE Column stats: COMPLETE
                       File Output Operator
                         compressed: false
-                        Statistics: Num rows: 2 Data size: 1234 Basic stats: 
COMPLETE Column stats: COMPLETE
+                        Statistics: Num rows: 250 Data size: 154250 Basic 
stats: COMPLETE Column stats: COMPLETE
                         table:
                             input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                             output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                             serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+                Select Operator
+                  expressions: _col0 (type: int), _col1 (type: string), _col2 
(type: string)
+                  outputColumnNames: _col0, _col1, _col2
+                  File Output Operator
+                    compressed: false
+                    Dp Sort State: PARTITION_SORTED

Review Comment:
   This `Dp Sort State` attribute appeared after this changes. Is this a good 
or bad thing?



##########
ql/src/test/queries/clientpositive/distributeby_cboret.q:
##########
@@ -0,0 +1,26 @@
+set hive.cbo.returnpath.hiveop=true;

Review Comment:
   Most tests that use this property are using the `cbo_rp_` prefix so for 
consistency lets do that here as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28572: Support Distribute by and Cluster by clauses in CBO [hive]

Reply via email to