Re: [PR] HIVE-27006: Fix ParallelEdgeFixer [hive]

via GitHub Thu, 16 Nov 2023 06:11:00 -0800


ngsg commented on code in PR #4043:
URL: https://github.com/apache/hive/pull/4043#discussion_r1395752162



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/DynamicPartitionPruningOptimization.java:
##########
@@ -678,38 +678,34 @@ private boolean 
generateSemiJoinOperatorPlan(DynamicListContext ctx, ParseContex
     ArrayList<ColumnInfo> groupbyColInfos = new ArrayList<ColumnInfo>();
     groupbyColInfos.add(new ColumnInfo(gbOutputNames.get(0), 
key.getTypeInfo(), "", false));
     groupbyColInfos.add(new ColumnInfo(gbOutputNames.get(1), 
key.getTypeInfo(), "", false));
-    groupbyColInfos.add(new ColumnInfo(gbOutputNames.get(2), 
key.getTypeInfo(), "", false));
+    groupbyColInfos.add(new ColumnInfo(gbOutputNames.get(2), 
TypeInfoFactory.binaryTypeInfo, "", false));
 
     GroupByOperator groupByOp = 
(GroupByOperator)OperatorFactory.getAndMakeChild(
             groupBy, new RowSchema(groupbyColInfos), selectOp);
 
     groupByOp.setColumnExprMap(new HashMap<String, ExprNodeDesc>());
 
     // Get the column names of the aggregations for reduce sink
-    int colPos = 0;
     ArrayList<ExprNodeDesc> rsValueCols = new ArrayList<ExprNodeDesc>();
     Map<String, ExprNodeDesc> columnExprMap = new HashMap<String, 
ExprNodeDesc>();
-    for (int i = 0; i < aggs.size() - 1; i++) {
-      ExprNodeColumnDesc colExpr = new ExprNodeColumnDesc(key.getTypeInfo(),
-              gbOutputNames.get(colPos), "", false);
+    ArrayList<ColumnInfo> rsColInfos = new ArrayList<>();
+    for (int colPos = 0; colPos < aggs.size(); colPos++) {
+      TypeInfo typInfo = groupbyColInfos.get(colPos).getType();
+      ExprNodeColumnDesc colExpr = new ExprNodeColumnDesc(typInfo, 
gbOutputNames.get(colPos), "", false);
       rsValueCols.add(colExpr);
-      columnExprMap.put(gbOutputNames.get(colPos), colExpr);
-      colPos++;
-    }
+      columnExprMap.put(Utilities.ReduceField.VALUE + "." + 
gbOutputNames.get(colPos), colExpr);
 
-    // Bloom Filter uses binary
-    ExprNodeColumnDesc colExpr = new 
ExprNodeColumnDesc(TypeInfoFactory.binaryTypeInfo,
-        gbOutputNames.get(colPos), "", false);
-    rsValueCols.add(colExpr);
-    columnExprMap.put(gbOutputNames.get(colPos), colExpr);
-    colPos++;
+      ColumnInfo colInfo =

Review Comment:
   @deniskuzZ, I have checked your comment and my work, and I summarized my 
conclusion as follows:
   
   1. about `ReduceField.VALUE`
   
   I think we should prepend the name of RS operator's columns in colExprMap 
and schema because RS's child operators always access to input columns(output 
of RS) as `KEY.col` and `VALUE.col`.
   
   RS operator's output rows are transported to next operator via shuffle, not 
by directly calling `Operator.forward()`. `ReduceRecordSource` reads shuffled 
KV pairs and calls the child operator's `Operator.process()` on behalf of 
`Operator.forward()`. If vectorization is disabled, it passes a `List<Object>` 
of length 2 as a row, and the corresponding ObjectInspector consists of 
`ReduceField.KEY` and `ReduceField.VALUE`. [1] If vectorization is enabled, it 
passes a single struct object as a row, and the corresponding ObjectInspector 
consists of `ReduceField.KEY + "." + fieldName` and `ReduceField.VALUE + "." + 
fieldName`. [2] In both cases, the name of columns come from RS's child 
operator start with either `ReduceField.KEY` or `ReduceField.VALUE`.
   
   `colExprMap` maps an output column name to its expression [3], so the key of 
`colExprMap` of RS should be prefixed by `ReduceField.KEY` or 
`ReduceField.VALUE`. I could not find any information about `schema`, but it 
seems that `schema` also represents output column names. [4] So I think both 
`colExprMap`'s key and `schema` of RS should be prefixed by `ReduceField.KEY` 
or `ReduceField.VALUE`.
   
   2. about your investigation
   
   DPPOptimization creates 4 operators, GBY->RS->GBY->RS, and 
`sharedwork_semi_2.q` tests PEF by inverting one of the final RS operators that 
DPPOptimization created. PEF refers to RS's `colExprMap` when creating a SEL 
operator that performs inversion. That's why we fail this test due to 
`java.lang.RuntimeException: cannot find field _col0 from [0:key, 1:value]` if 
we do not prepend the key of final RS's `colExprMap` with `ReduceField.VALUE`. 
Unlike final RS operators, intermediate RS operators are not inverted during 
the test,  so prepending intermediate RS operator's column name does not affect 
to the result of the test.
   
   3. about `ParallelEdgeFixer.colMappingInverseKeys()`
   
   According to the comment of Operator.getColumnExprMap(), it returns only key 
columns for RS operators. [3] I'm not sure whether it is still valid or not, 
but I want to keep the added code as a kind of defensive programming. 
   
   [1] 
https://github.com/apache/hive/blob/8a4f5ce7275842ff4f1cc917c7a2a48dde71bf4c/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/ReduceRecordSource.java#L229-L231
   [2] 
https://github.com/apache/hive/blob/8a4f5ce7275842ff4f1cc917c7a2a48dde71bf4c/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L4333-L4341
   [3]
   
https://github.com/apache/hive/blob/8a4f5ce7275842ff4f1cc917c7a2a48dde71bf4c/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java#L991-L997
   [4]
   
https://github.com/apache/hive/blob/8a4f5ce7275842ff4f1cc917c7a2a48dde71bf4c/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java#L443-L447
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-27006: Fix ParallelEdgeFixer [hive]

Reply via email to