Re: [PR] HIVE-28510: Iceberg: FanoutPositionOnlyDeleteWriter support [hive]

via GitHub Tue, 17 Sep 2024 03:06:20 -0700


kasakrisz commented on code in PR #5418:
URL: https://github.com/apache/hive/pull/5418#discussion_r1761300571



##########
ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java:
##########
@@ -498,6 +493,11 @@ default FieldSchema getRowId() {
    * @return the list of columns that should be used as sort columns in the 
rewritten ACID query
    */
   default List<FieldSchema> 
acidSortColumns(org.apache.hadoop.hive.ql.metadata.Table table, Operation 
operation) {
+    return acidSortColumns(table, operation, false);
+  }
+
+  default List<FieldSchema> 
acidSortColumns(org.apache.hadoop.hive.ql.metadata.Table table, Operation 
operation, 
+      boolean emptyOrdering) {

Review Comment:
   The API change can be avoided: I found that `emptyOrdering` is always `true` 
so we want to let storage handler to decided which are the sort keys.
   There are 2 cases when it is `false`, both are in MergeRewriter
   * cardinality violation check
   
https://github.com/apache/hive/blob/10efefbfbb3722e9fdc8b438838bd7c964bf5224/ql/src/java/org/apache/hadoop/hive/ql/parse/rewrite/MergeRewriter.java#L123C42-L123C49
   * when matched update clause
   
https://github.com/apache/hive/blob/10efefbfbb3722e9fdc8b438838bd7c964bf5224/ql/src/java/org/apache/hadoop/hive/ql/parse/rewrite/MergeRewriter.java#L213
   The second one is overriden in SplitMergeRewriter so it is not used in case 
of Iceberg.
   
   Instead of calling `getSortKeys` and 
`HiveStorageHandler.acidSortColumns(table, operation, false)` we can add a new 
method `MultiInsertSqlGenerator.getAcidSelectColumns` which calls 
`HiveStorageHandler.acidSelectColumns(targetTable, operation)`. In this case 
operation is `MERGE`



##########
iceberg/iceberg-handler/src/test/results/positive/iceberg_truncate_partition_with_evolution.q.out:
##########
@@ -78,9 +78,6 @@ STAGE DEPENDENCIES:
 STAGE PLANS:
   Stage: Stage-2
     Tez
-#### A masked pattern was here ####
-      Edges:
-        Reducer 2 <- Map 1 (SIMPLE_EDGE)

Review Comment:
   ```
   truncate table test_ice_int partition (a = 22)
   ```
   is rewritten to 
   ```
   delete from `default`.`test_ice_int` where `a` = 22
   ```
   then DeleteSemanticAnalyzer rewrites it to
   ```
   insert into table `default`.`test_ice_int` 
   select `PARTITION__SPEC__ID` AS `__d__PARTITION__SPEC__ID`,`PARTITION__HASH` 
AS `__d__PARTITION__HASH`,`FILE__PATH` AS `__d__FILE__PATH`,`ROW__POSITION` AS 
`__d__ROW__POSITION`,`PARTITION__PROJECTION` AS 
`__d__PARTITION__PROJECTION`,`a` AS `__d__a`,`b` AS `__d__b` 
   from `default`.`test_ice_int`
   ```
   Sort by clause is not added because `write.fanout.enabled` is default to 
`true`, hence no reducer is necessary.



##########
iceberg/iceberg-handler/src/test/results/positive/merge_iceberg_orc.q.out:
##########
@@ -186,67 +188,37 @@ STAGE PLANS:
                     predicate: (_col10 = _col1) (type: boolean)
                     Statistics: Num rows: 5 Data size: 1929 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Select Operator
-                      expressions: _col2 (type: string), _col3 (type: string), 
_col5 (type: bigint), _col6 (type: bigint), _col7 (type: int)
-                      outputColumnNames: _col2, _col3, _col5, _col6, _col7
+                      expressions: _col2 (type: string), _col5 (type: bigint), 
_col6 (type: bigint), _col7 (type: int)
+                      outputColumnNames: _col2, _col5, _col6, _col7
                       Statistics: Num rows: 5 Data size: 1929 Basic stats: 
COMPLETE Column stats: COMPLETE
                       Group By Operator
                         aggregations: count()
-                        keys: _col7 (type: int), _col6 (type: bigint), _col2 
(type: string), _col5 (type: bigint), _col3 (type: string)
+                        keys: _col7 (type: int), _col6 (type: bigint), _col2 
(type: string), _col5 (type: bigint)
                         minReductionHashAggr: 0.4
                         mode: hash
-                        outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5
-                        Statistics: Num rows: 4 Data size: 1196 Basic stats: 
COMPLETE Column stats: COMPLETE
+                        outputColumnNames: _col0, _col1, _col2, _col3, _col4
+                        Statistics: Num rows: 4 Data size: 644 Basic stats: 
COMPLETE Column stats: COMPLETE
                         Reduce Output Operator
-                          key expressions: _col0 (type: int), _col1 (type: 
bigint), _col2 (type: string), _col3 (type: bigint), _col4 (type: string)
-                          null sort order: zzzzz
-                          sort order: +++++

Review Comment:
   Originally `ACID_VIRTUAL_COLS_AS_FIELD_SCHEMA` was used when sort columns 
were pulled and it contains `VirtualColumn.PARTITION_PROJECTION` too.
   Now 
   ```
     private static final List<FieldSchema> POSITION_DELETE_ORDERING =
           orderBy(PARTITION_SPEC_ID, PARTITION_HASH, FILE_PATH, ROW_POSITION);
   ```
   is used. This doesn't include VirtualColumn.PARTITION_PROJECTION. Is this 
change intentional?



##########
iceberg/iceberg-handler/src/test/results/positive/merge_iceberg_orc.q.out:
##########
@@ -68,10 +68,8 @@ STAGE PLANS:
     Tez
 #### A masked pattern was here ####
       Edges:
-        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 6 (SIMPLE_EDGE)
+        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
-        Reducer 4 <- Reducer 2 (SIMPLE_EDGE)
-        Reducer 5 <- Reducer 2 (SIMPLE_EDGE)

Review Comment:
   Similar to the case when a truncate table is transformed into a delete, no 
sort reducers are required. I think this change make sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28510: Iceberg: FanoutPositionOnlyDeleteWriter support [hive]

Reply via email to