Re: [PR] HIVE-28572: Support Distribute by and Cluster by clauses in CBO [hive]

via GitHub Tue, 21 Jan 2025 02:35:13 -0800


kasakrisz commented on code in PR #5505:
URL: https://github.com/apache/hive/pull/5505#discussion_r1923475434



##########
ql/src/test/results/clientpositive/llap/implicit_cast_during_insert.q.out:
##########
@@ -40,60 +40,64 @@ STAGE PLANS:
             Map Operator Tree:
                 TableScan
                   alias: src
-                  filterExpr: (key) IN (0, 1) (type: boolean)
+                  filterExpr: (UDFToDouble(key)) IN (0.0D, 1.0D) (type: 
boolean)
                   Statistics: Num rows: 500 Data size: 89000 Basic stats: 
COMPLETE Column stats: COMPLETE
                   Filter Operator
-                    predicate: (key) IN (0, 1) (type: boolean)
-                    Statistics: Num rows: 3 Data size: 534 Basic stats: 
COMPLETE Column stats: COMPLETE
+                    predicate: (UDFToDouble(key)) IN (0.0D, 1.0D) (type: 
boolean)
+                    Statistics: Num rows: 250 Data size: 44500 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Select Operator
                       expressions: value (type: string), key (type: string)
                       outputColumnNames: _col1, _col2
-                      Statistics: Num rows: 3 Data size: 534 Basic stats: 
COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 250 Data size: 44500 Basic stats: 
COMPLETE Column stats: COMPLETE
                       Reduce Output Operator
                         key expressions: _col2 (type: string)
                         null sort order: z
                         sort order: +
                         Map-reduce partition columns: _col2 (type: string)
-                        Statistics: Num rows: 3 Data size: 534 Basic stats: 
COMPLETE Column stats: COMPLETE
+                        Statistics: Num rows: 250 Data size: 44500 Basic 
stats: COMPLETE Column stats: COMPLETE
                         value expressions: _col1 (type: string)
-            Execution mode: llap
+            Execution mode: vectorized, llap
             LLAP IO: all inputs
         Reducer 2 
             Execution mode: vectorized, llap
             Reduce Operator Tree:
               Select Operator
                 expressions: UDFToInteger(KEY.reducesinkkey0) (type: int), 
VALUE._col0 (type: string), KEY.reducesinkkey0 (type: string)
                 outputColumnNames: _col0, _col1, _col2
-                Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE 
Column stats: COMPLETE
-                File Output Operator
-                  compressed: false
-                  Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE 
Column stats: COMPLETE
-                  table:
-                      input format: 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
-                      output format: 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
-                      serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
-                      name: default.implicit_cast_during_insert
+                Statistics: Num rows: 250 Data size: 45500 Basic stats: 
COMPLETE Column stats: COMPLETE
                 Select Operator
                   expressions: _col0 (type: int), _col1 (type: string), _col2 
(type: string)
                   outputColumnNames: c1, c2, p1
-                  Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE 
Column stats: COMPLETE
+                  Statistics: Num rows: 250 Data size: 45500 Basic stats: 
COMPLETE Column stats: COMPLETE
                   Group By Operator
                     aggregations: min(c1), max(c1), count(1), count(c1), 
compute_bit_vector_hll(c1), max(length(c2)), avg(COALESCE(length(c2),0)), 
count(c2), compute_bit_vector_hll(c2)
                     keys: p1 (type: string)
                     mode: complete
                     outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5, _col6, _col7, _col8, _col9
-                    Statistics: Num rows: 2 Data size: 838 Basic stats: 
COMPLETE Column stats: COMPLETE
+                    Statistics: Num rows: 250 Data size: 104750 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Select Operator
                       expressions: 'LONG' (type: string), UDFToLong(_col1) 
(type: bigint), UDFToLong(_col2) (type: bigint), (_col3 - _col4) (type: 
bigint), COALESCE(ndv_compute_bit_vector(_col5),0) (type: bigint), _col5 (type: 
binary), 'STRING' (type: string), UDFToLong(COALESCE(_col6,0)) (type: bigint), 
COALESCE(_col7,0) (type: double), (_col3 - _col8) (type: bigint), 
COALESCE(ndv_compute_bit_vector(_col9),0) (type: bigint), _col9 (type: binary), 
_col0 (type: string)
                       outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
-                      Statistics: Num rows: 2 Data size: 1234 Basic stats: 
COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 250 Data size: 154250 Basic stats: 
COMPLETE Column stats: COMPLETE
                       File Output Operator
                         compressed: false
-                        Statistics: Num rows: 2 Data size: 1234 Basic stats: 
COMPLETE Column stats: COMPLETE
+                        Statistics: Num rows: 250 Data size: 154250 Basic 
stats: COMPLETE Column stats: COMPLETE
                         table:
                             input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                             output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                             serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+                Select Operator
+                  expressions: _col0 (type: int), _col1 (type: string), _col2 
(type: string)
+                  outputColumnNames: _col0, _col1, _col2
+                  File Output Operator
+                    compressed: false
+                    Dp Sort State: PARTITION_SORTED

Review Comment:
   The root cause of this change is the change in the `Num rows` stats. It is 
changed because this patch enabled CBO for the query in test. The filter 
predicate is changed from
   ```
   (key) IN (0, 1)
   ```
   to
   ```
   (UDFToDouble(key)) IN (0.0D, 1.0D)
   ```
   so type conversion is added.
   
   Then the logic populates stats for all the operators in the Hive operator 
plan fails to recognize that `UDFToDouble(key)` is still a column but casted 
hence the default estimation is returned. 
    
   
https://github.com/apache/hive/blob/bc87c4d96374fcf7489809e1b5125c10254301c1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L474-L478
   
   SortedDynPartitionOptimizer relies on stats.
   
   A cast does not always change its input value so there is room for 
improvement of the stats estimation of expressions with `in` operator.
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28572: Support Distribute by and Cluster by clauses in CBO [hive]

Reply via email to