Re: [PR] HIVE-28572: Support Distribute by and Cluster by clauses in CBO [hive]

via GitHub Wed, 22 Jan 2025 03:53:27 -0800


zabetak commented on code in PR #5505:
URL: https://github.com/apache/hive/pull/5505#discussion_r1925192558



##########
ql/src/test/results/clientpositive/llap/implicit_cast_during_insert.q.out:
##########
@@ -40,60 +40,64 @@ STAGE PLANS:
             Map Operator Tree:
                 TableScan
                   alias: src
-                  filterExpr: (key) IN (0, 1) (type: boolean)
+                  filterExpr: (UDFToDouble(key)) IN (0.0D, 1.0D) (type: 
boolean)
                   Statistics: Num rows: 500 Data size: 89000 Basic stats: 
COMPLETE Column stats: COMPLETE
                   Filter Operator
-                    predicate: (key) IN (0, 1) (type: boolean)
-                    Statistics: Num rows: 3 Data size: 534 Basic stats: 
COMPLETE Column stats: COMPLETE
+                    predicate: (UDFToDouble(key)) IN (0.0D, 1.0D) (type: 
boolean)
+                    Statistics: Num rows: 250 Data size: 44500 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Select Operator
                       expressions: value (type: string), key (type: string)
                       outputColumnNames: _col1, _col2
-                      Statistics: Num rows: 3 Data size: 534 Basic stats: 
COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 250 Data size: 44500 Basic stats: 
COMPLETE Column stats: COMPLETE
                       Reduce Output Operator
                         key expressions: _col2 (type: string)
                         null sort order: z
                         sort order: +
                         Map-reduce partition columns: _col2 (type: string)
-                        Statistics: Num rows: 3 Data size: 534 Basic stats: 
COMPLETE Column stats: COMPLETE
+                        Statistics: Num rows: 250 Data size: 44500 Basic 
stats: COMPLETE Column stats: COMPLETE
                         value expressions: _col1 (type: string)
-            Execution mode: llap
+            Execution mode: vectorized, llap
             LLAP IO: all inputs
         Reducer 2 
             Execution mode: vectorized, llap
             Reduce Operator Tree:
               Select Operator
                 expressions: UDFToInteger(KEY.reducesinkkey0) (type: int), 
VALUE._col0 (type: string), KEY.reducesinkkey0 (type: string)
                 outputColumnNames: _col0, _col1, _col2
-                Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE 
Column stats: COMPLETE
-                File Output Operator
-                  compressed: false
-                  Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE 
Column stats: COMPLETE
-                  table:
-                      input format: 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
-                      output format: 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
-                      serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
-                      name: default.implicit_cast_during_insert
+                Statistics: Num rows: 250 Data size: 45500 Basic stats: 
COMPLETE Column stats: COMPLETE
                 Select Operator
                   expressions: _col0 (type: int), _col1 (type: string), _col2 
(type: string)
                   outputColumnNames: c1, c2, p1
-                  Statistics: Num rows: 3 Data size: 546 Basic stats: COMPLETE 
Column stats: COMPLETE
+                  Statistics: Num rows: 250 Data size: 45500 Basic stats: 
COMPLETE Column stats: COMPLETE
                   Group By Operator
                     aggregations: min(c1), max(c1), count(1), count(c1), 
compute_bit_vector_hll(c1), max(length(c2)), avg(COALESCE(length(c2),0)), 
count(c2), compute_bit_vector_hll(c2)
                     keys: p1 (type: string)
                     mode: complete
                     outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5, _col6, _col7, _col8, _col9
-                    Statistics: Num rows: 2 Data size: 838 Basic stats: 
COMPLETE Column stats: COMPLETE
+                    Statistics: Num rows: 250 Data size: 104750 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Select Operator
                       expressions: 'LONG' (type: string), UDFToLong(_col1) 
(type: bigint), UDFToLong(_col2) (type: bigint), (_col3 - _col4) (type: 
bigint), COALESCE(ndv_compute_bit_vector(_col5),0) (type: bigint), _col5 (type: 
binary), 'STRING' (type: string), UDFToLong(COALESCE(_col6,0)) (type: bigint), 
COALESCE(_col7,0) (type: double), (_col3 - _col8) (type: bigint), 
COALESCE(ndv_compute_bit_vector(_col9),0) (type: bigint), _col9 (type: binary), 
_col0 (type: string)
                       outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
-                      Statistics: Num rows: 2 Data size: 1234 Basic stats: 
COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 250 Data size: 154250 Basic stats: 
COMPLETE Column stats: COMPLETE
                       File Output Operator
                         compressed: false
-                        Statistics: Num rows: 2 Data size: 1234 Basic stats: 
COMPLETE Column stats: COMPLETE
+                        Statistics: Num rows: 250 Data size: 154250 Basic 
stats: COMPLETE Column stats: COMPLETE
                         table:
                             input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                             output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                             serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+                Select Operator
+                  expressions: _col0 (type: int), _col1 (type: string), _col2 
(type: string)
+                  outputColumnNames: _col0, _col1, _col2
+                  File Output Operator
+                    compressed: false
+                    Dp Sort State: PARTITION_SORTED

Review Comment:
   Thanks for the explanation. The stats improvement is orthogonal to this PR 
so we don't need to handle it here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28572: Support Distribute by and Cluster by clauses in CBO [hive]

Reply via email to