zabetak commented on code in PR #5505:
URL: https://github.com/apache/hive/pull/5505#discussion_r1918056949
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSortTransposeRule.java:
##########
@@ -22,31 +22,36 @@
import org.apache.calcite.rel.RelNode;
import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveFilter;
+import
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortExchange;
import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit;
import com.google.common.collect.ImmutableList;
-public class HiveFilterSortTransposeRule extends RelOptRule {
+import static java.util.Collections.singletonList;
- public static final HiveFilterSortTransposeRule INSTANCE =
- new HiveFilterSortTransposeRule();
+public class HiveFilterSortTransposeRule<T extends RelNode> extends RelOptRule
{
+
+ public static final HiveFilterSortTransposeRule<HiveSortLimit>
SORT_LIMIT_INSTANCE =
+ new HiveFilterSortTransposeRule<>(HiveSortLimit.class);
+ public static final HiveFilterSortTransposeRule<HiveSortExchange>
SORT_EXCHANGE_INSTANCE =
+ new HiveFilterSortTransposeRule<>(HiveSortExchange.class);
//~ Constructors -----------------------------------------------------------
/**
* Creates a HiveFilterSortTransposeRule.
*/
- private HiveFilterSortTransposeRule() {
+ private HiveFilterSortTransposeRule(Class<T> clazz) {
Review Comment:
It doesn't seem necessary to make the class generic. We could simply declare
the argument as `Class<? extends RelNode>` with the same type safety guarantees
as now.
##########
ql/src/test/queries/clientpositive/reducesink_dedup.q:
##########
@@ -1,3 +1,4 @@
+-- SORT_QUERY_RESULTS
--! qt:dataset:part
--! qt:dataset:src
select p_name
Review Comment:
The results of this query haven't changed since 2019 so I am wondering why
are they changing now? Did the underlying plan change?
##########
ql/src/test/queries/clientpositive/distributeby.q:
##########
@@ -0,0 +1,24 @@
+create table t1 (a string, b int);
+
+insert into t1 values ('2014-03-14 10:10:12', 10);
+
+-- distribute by
+explain cbo
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14'
distribute by a;
+explain
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14'
distribute by a;
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14'
distribute by a;
Review Comment:
It seems that for this query we verify the results but not for the ones that
follow. I would think that there are already tests that cover the
correctness/results of these clauses. Do need new ones?
##########
ql/src/test/results/clientpositive/llap/mapreduce5.q.out:
##########
@@ -45,7 +45,7 @@ STAGE PLANS:
alias: src
Statistics: Num rows: 500 Data size: 89000 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
- expressions: key (type: string), UDFToInteger((key / 10))
(type: int), UDFToInteger((key % 10)) (type: int), value (type: string)
+ expressions: key (type: string),
UDFToInteger((UDFToDouble(key) / 10.0D)) (type: int),
UDFToInteger((UDFToDouble(key) % 10.0D)) (type: int), value (type: string)
Review Comment:
I see various places where `UDFtoDouble` is introduced. I guess the type
casting rules are different with and without CBO. I guess this is also
something that may cause result changes after the PR gets in.
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java:
##########
@@ -995,12 +995,6 @@ private static String canHandleQbForCbo(QueryProperties
queryProperties,
HiveConf conf, boolean topLevelQB) {
List<String> reasons = new ArrayList<>();
// Not ok to run CBO, build error message.
- if (queryProperties.hasClusterBy()) {
- reasons.add("has cluster by");
- }
- if (queryProperties.hasDistributeBy()) {
- reasons.add("has distribute by");
- }
if (queryProperties.hasSortBy() && queryProperties.hasLimit()) {
reasons.add("has sort by with limit");
}
Review Comment:
What does it mean sort by with limit? Why we cannot handle it?
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java:
##########
@@ -5611,4 +5365,282 @@ private enum TableType {
JDBC
}
+ private class OrderByRelBuilder {
Review Comment:
Since a lot of code has been refactored and changed place as part of this
change it may be a good opportunity to extract this refactored class completely
outside of the `CalcitePlanner` if that is possible (keeping it package
private).
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSortTransposeRule.java:
##########
@@ -22,31 +22,36 @@
import org.apache.calcite.rel.RelNode;
import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveFilter;
+import
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortExchange;
import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit;
import com.google.common.collect.ImmutableList;
-public class HiveFilterSortTransposeRule extends RelOptRule {
+import static java.util.Collections.singletonList;
- public static final HiveFilterSortTransposeRule INSTANCE =
- new HiveFilterSortTransposeRule();
+public class HiveFilterSortTransposeRule<T extends RelNode> extends RelOptRule
{
Review Comment:
Adding generics does not seem to improve the type safety of the class or its
usages. On the negative side it makes rule declarations more verbose.
##########
ql/src/test/queries/clientpositive/distributeby.q:
##########
@@ -0,0 +1,24 @@
+create table t1 (a string, b int);
+
+insert into t1 values ('2014-03-14 10:10:12', 10);
+
+-- distribute by
+explain cbo
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14'
distribute by a;
Review Comment:
Why is there a WHERE clause and `date_add` functions? Are we intend to test
something more than just the `DISTRIBUTE BY` clause? If not relevant then
please remove them. If they are relevant then explain why we don't need the
same tests for the rest of the queries.
##########
ql/src/test/queries/clientpositive/distributeby_cboret.q:
##########
@@ -0,0 +1,26 @@
+set hive.cbo.returnpath.hiveop=true;
+
+create table t1 (a string, b int);
+
+insert into t1 values ('2014-03-14 10:10:12', 10);
+
+-- distribute by
+explain cbo
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14'
distribute by a;
Review Comment:
I don't see the CBO plans in the .q.out file. Is this normal?
##########
ql/src/test/queries/clientpositive/distributeby.q:
##########
@@ -0,0 +1,24 @@
+create table t1 (a string, b int);
+
+insert into t1 values ('2014-03-14 10:10:12', 10);
+
+-- distribute by
+explain cbo
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14'
distribute by a;
+explain
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14'
distribute by a;
+select * from t1 where a between date_add('2014-03-14', -1) and '2014-03-14'
distribute by a;
+
+-- distribute by and sort by
+explain cbo
+select * from t1 distribute by a, b sort by a;
Review Comment:
I think Hive allows to have `SORT BY` without `DISTRIBUTE BY` can we also
add an `EXPLAIN CBO` test about this case.
##########
ql/src/test/results/clientpositive/llap/semijoin_hint.q.out:
##########
@@ -535,14 +535,14 @@ STAGE PLANS:
keys:
0 cstring (type: string)
1 value (type: string)
- outputColumnNames: str
- Statistics: Num rows: 3068 Data size: 266916 Basic stats:
COMPLETE Column stats: COMPLETE
+ outputColumnNames: cstring, str
+ Statistics: Num rows: 3068 Data size: 336852 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: str (type: string)
null sort order: z
sort order: +
- Map-reduce partition columns: str (type: string)
- Statistics: Num rows: 3068 Data size: 266916 Basic stats:
COMPLETE Column stats: COMPLETE
+ Map-reduce partition columns: cstring (type: string)
Review Comment:
Are we picking another column? Why?
##########
ql/src/test/queries/clientpositive/distributeby.q:
##########
@@ -0,0 +1,24 @@
+create table t1 (a string, b int);
Review Comment:
Since these tests are mainly for cbo changes consider renaming the file to
`cbo_distributeby.q` or maybe `cbo_distribute_sort_cluster_by.q` since we don't
have only distribute clause.
##########
ql/src/test/results/clientpositive/llap/input_columnarserde.q.out:
##########
@@ -54,7 +54,7 @@ STAGE PLANS:
Reduce Output Operator
null sort order:
sort order:
- Map-reduce partition columns: 1 (type: int)
+ Map-reduce partition columns: _col0 (type: array<int>)
Review Comment:
This plan change changes the semantics of the plan and could affect the
result set. Did we fix a bug?
What are the semantics of `DISTRIBUTE BY 1`? Does it mean distribute by the
column in position 1 or does it mean distribute by the numeric constant 1?
##########
ql/src/test/results/clientpositive/llap/implicit_cast_during_insert.q.out:
##########
@@ -40,60 +40,64 @@ STAGE PLANS:
Map Operator Tree:
TableScan
alias: src
- filterExpr: (key) IN (0, 1) (type: boolean)
+ filterExpr: (UDFToDouble(key)) IN (0.0D, 1.0D) (type:
boolean)
Statistics: Num rows: 500 Data size: 89000 Basic stats:
COMPLETE Column stats: COMPLETE
Filter Operator
- predicate: (key) IN (0, 1) (type: boolean)
- Statistics: Num rows: 3 Data size: 534 Basic stats:
COMPLETE Column stats: COMPLETE
+ predicate: (UDFToDouble(key)) IN (0.0D, 1.0D) (type:
boolean)
+ Statistics: Num rows: 250 Data size: 44500 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: value (type: string), key (type: string)
outputColumnNames: _col1, _col2
- Statistics: Num rows: 3 Data size: 534 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 44500 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col2 (type: string)
null sort order: z
sort order: +
Map-reduce partition columns: _col2 (type: string)
- Statistics: Num rows: 3 Data size: 534 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 44500 Basic
stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: string)
- Execution mode: llap
+ Execution mode: vectorized, llap
LLAP IO: all inputs
Reducer 2
Execution mode: vectorized, llap
Reduce Operator Tree:
Select Operator
expressions: UDFToInteger(KEY.reducesinkkey0) (type: int),
VALUE._col0 (type: string), KEY.reducesinkkey0 (type: string)
outputColumnNames: _col0, _col1, _col2
- Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE
Column stats: COMPLETE
- File Output Operator
- compressed: false
- Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE
Column stats: COMPLETE
- table:
- input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
- output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
- serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
- name: default.implicit_cast_during_insert
+ Statistics: Num rows: 250 Data size: 45500 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col2
(type: string)
outputColumnNames: c1, c2, p1
- Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE
Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 45500 Basic stats:
COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: min(c1), max(c1), count(1), count(c1),
compute_bit_vector_hll(c1), max(length(c2)), avg(COALESCE(length(c2),0)),
count(c2), compute_bit_vector_hll(c2)
keys: p1 (type: string)
mode: complete
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6, _col7, _col8, _col9
- Statistics: Num rows: 2 Data size: 838 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 104750 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'LONG' (type: string), UDFToLong(_col1)
(type: bigint), UDFToLong(_col2) (type: bigint), (_col3 - _col4) (type:
bigint), COALESCE(ndv_compute_bit_vector(_col5),0) (type: bigint), _col5 (type:
binary), 'STRING' (type: string), UDFToLong(COALESCE(_col6,0)) (type: bigint),
COALESCE(_col7,0) (type: double), (_col3 - _col8) (type: bigint),
COALESCE(ndv_compute_bit_vector(_col9),0) (type: bigint), _col9 (type: binary),
_col0 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
- Statistics: Num rows: 2 Data size: 1234 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 154250 Basic stats:
COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
- Statistics: Num rows: 2 Data size: 1234 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 154250 Basic
stats: COMPLETE Column stats: COMPLETE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+ Select Operator
+ expressions: _col0 (type: int), _col1 (type: string), _col2
(type: string)
+ outputColumnNames: _col0, _col1, _col2
+ File Output Operator
+ compressed: false
+ Dp Sort State: PARTITION_SORTED
Review Comment:
This `Dp Sort State` attribute appeared after this changes. Is this a good
or bad thing?
##########
ql/src/test/queries/clientpositive/distributeby_cboret.q:
##########
@@ -0,0 +1,26 @@
+set hive.cbo.returnpath.hiveop=true;
Review Comment:
Most tests that use this property are using the `cbo_rp_` prefix so for
consistency lets do that here as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]