kasakrisz commented on code in PR #5505:
URL: https://github.com/apache/hive/pull/5505#discussion_r1923475434
##########
ql/src/test/results/clientpositive/llap/implicit_cast_during_insert.q.out:
##########
@@ -40,60 +40,64 @@ STAGE PLANS:
Map Operator Tree:
TableScan
alias: src
- filterExpr: (key) IN (0, 1) (type: boolean)
+ filterExpr: (UDFToDouble(key)) IN (0.0D, 1.0D) (type:
boolean)
Statistics: Num rows: 500 Data size: 89000 Basic stats:
COMPLETE Column stats: COMPLETE
Filter Operator
- predicate: (key) IN (0, 1) (type: boolean)
- Statistics: Num rows: 3 Data size: 534 Basic stats:
COMPLETE Column stats: COMPLETE
+ predicate: (UDFToDouble(key)) IN (0.0D, 1.0D) (type:
boolean)
+ Statistics: Num rows: 250 Data size: 44500 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: value (type: string), key (type: string)
outputColumnNames: _col1, _col2
- Statistics: Num rows: 3 Data size: 534 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 44500 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col2 (type: string)
null sort order: z
sort order: +
Map-reduce partition columns: _col2 (type: string)
- Statistics: Num rows: 3 Data size: 534 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 44500 Basic
stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: string)
- Execution mode: llap
+ Execution mode: vectorized, llap
LLAP IO: all inputs
Reducer 2
Execution mode: vectorized, llap
Reduce Operator Tree:
Select Operator
expressions: UDFToInteger(KEY.reducesinkkey0) (type: int),
VALUE._col0 (type: string), KEY.reducesinkkey0 (type: string)
outputColumnNames: _col0, _col1, _col2
- Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE
Column stats: COMPLETE
- File Output Operator
- compressed: false
- Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE
Column stats: COMPLETE
- table:
- input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
- output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
- serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
- name: default.implicit_cast_during_insert
+ Statistics: Num rows: 250 Data size: 45500 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col2
(type: string)
outputColumnNames: c1, c2, p1
- Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE
Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 45500 Basic stats:
COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: min(c1), max(c1), count(1), count(c1),
compute_bit_vector_hll(c1), max(length(c2)), avg(COALESCE(length(c2),0)),
count(c2), compute_bit_vector_hll(c2)
keys: p1 (type: string)
mode: complete
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6, _col7, _col8, _col9
- Statistics: Num rows: 2 Data size: 838 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 104750 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'LONG' (type: string), UDFToLong(_col1)
(type: bigint), UDFToLong(_col2) (type: bigint), (_col3 - _col4) (type:
bigint), COALESCE(ndv_compute_bit_vector(_col5),0) (type: bigint), _col5 (type:
binary), 'STRING' (type: string), UDFToLong(COALESCE(_col6,0)) (type: bigint),
COALESCE(_col7,0) (type: double), (_col3 - _col8) (type: bigint),
COALESCE(ndv_compute_bit_vector(_col9),0) (type: bigint), _col9 (type: binary),
_col0 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
- Statistics: Num rows: 2 Data size: 1234 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 154250 Basic stats:
COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
- Statistics: Num rows: 2 Data size: 1234 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 250 Data size: 154250 Basic
stats: COMPLETE Column stats: COMPLETE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+ Select Operator
+ expressions: _col0 (type: int), _col1 (type: string), _col2
(type: string)
+ outputColumnNames: _col0, _col1, _col2
+ File Output Operator
+ compressed: false
+ Dp Sort State: PARTITION_SORTED
Review Comment:
The root cause of this change is the change in the `Num rows` stats. It is
changed because this patch enabled CBO for the query in test. The filter
predicate is changed from
```
(key) IN (0, 1)
```
to
```
(UDFToDouble(key)) IN (0.0D, 1.0D)
```
so type conversion is added.
Then the logic populates stats for all the operators in the Hive operator
plan fails to recognize that `UDFToDouble(key)` is still a column but casted
hence the default estimation is returned.
https://github.com/apache/hive/blob/bc87c4d96374fcf7489809e1b5125c10254301c1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L474-L478
SortedDynPartitionOptimizer relies on stats.
A cast does not always change its input value so there is room for
improvement of the stats estimation of expressions with `in` operator.
WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]