zabetak commented on code in PR #6165:
URL: https://github.com/apache/hive/pull/6165#discussion_r2549975690


##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/MapJoinTestData.java:
##########
@@ -297,14 +297,25 @@ public static void generateVariationData(MapJoinTestData 
testData,
 
   private static RowTestObjects 
generateRandomSmallTableValueRow(MapJoinTestDescription testDesc, Random 
random) {
 
+    final ValueOption valueOption = 
testDesc.getSmallTableGenerationParameters().getValueOption();
     final int columnCount = testDesc.smallTableValueTypeInfos.length;
     PrimitiveTypeInfo[] primitiveTypeInfos = new 
PrimitiveTypeInfo[columnCount];
     for (int i = 0; i < columnCount; i++) {
       primitiveTypeInfos[i] = (PrimitiveTypeInfo) 
testDesc.smallTableValueTypeInfos[i];
     }
-    Object[] smallTableValueRow =
-        VectorRandomRowSource.randomWritablePrimitiveRow(
-            columnCount, random, primitiveTypeInfos);
+    Object[] smallTableValueRow;
+    
+    // Generate empty value row if specified.
+    if (valueOption == ValueOption.EMPTY_VALUE) {
+      smallTableValueRow = new Object[columnCount];
+      for (int c = 0; c < columnCount; c++) {
+        smallTableValueRow[c] =
+            
VectorizedBatchUtil.getPrimitiveWritable(primitiveTypeInfos[c].getPrimitiveCategory());

Review Comment:
   Which real use-case is this code trying to simulate? Is this equivalent to 
having `null` values in the data or something else. Basically, I am trying to 
understand under what circumstances we have these values in the table.
   
   From a quick look, it seems that we are using special values (e.g., `new 
Text(ArrayUtils.EMPTY_BYTE_ARRAY)`) but not really nulls. If that's the case 
then I don't see why we need to handle this separately from 
`VectorRandomRowSource.randomWritablePrimitiveRow`. Wouldn't it make more sense 
to tune the random generator to occasioanlly generate this "special" values if 
they can really appear in practice?



##########
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinGenerateResultOperator.java:
##########
@@ -226,6 +223,29 @@ protected int 
generateHashMapResultSingleValue(VectorizedRowBatch batch,
     return numSel;
   }
 
+  private void updateBatchWithSmallTableValue(
+      VectorizedRowBatch batch, VectorMapJoinHashMapResult hashMapResult, int 
batchIndex, ByteSegmentRef byteSegmentRef)
+      throws HiveException {
+
+    // Check if the small table value is empty.
+    boolean isSmallTableValueEmpty = byteSegmentRef.getLength() == 0;

Review Comment:
   The fact that we need to check the actual data in order to decide how to 
evaluate the join (or rather the creation of the resulting row) is somewhat 
suspicious and a bit brittle. Ideally, the compiler should be able to determine 
exactly how the operator should behave via the query plan. Can we exploit (or 
add) information in the query plan in order to drive the copy decision below?



##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/TestMapJoinOperator.java:
##########
@@ -1512,8 +1605,17 @@ private RowTestObjectsMultiSet 
createExpectedTestRowMultiSet(MapJoinTestDescript
                 Object[] valueRow = valueList.get(v).getRow();
                 final int smallTableRetainValueColumnNumsLength =
                     testDesc.smallTableRetainValueColumnNums.length;
+
+                // When EMPTY_VALUE is specified, the small table value 
columns are not
+                // actually stored in the small table HashMap. So we have to 
simulate that here.
+                // The expected output should have values from the big table 
key columns.
+                final boolean isEmptyValue = 
+                    testDesc.smallTableGenerationParameters.getValueOption() 
== ValueOption.EMPTY_VALUE &&
+                    testDesc.smallTableRetainValueColumnNums.length > 0 &&
+                    testDesc.smallTableRetainValueColumnNums.length == 
testDesc.bigTableKeyColumnNums.length;

Review Comment:
   This simulation is problematic cause it makes the solution and the test code 
somewhat identical. We're implementing a copy logic in two places (prod & test) 
so the tests will trivially pass as it is right now and immediately fail if the 
implementation changes in the future.



##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/MapJoinTestDescription.java:
##########
@@ -44,7 +45,8 @@ public static class SmallTableGenerationParameters {
     public static enum ValueOption {
       NO_RESTRICTION,
       ONLY_ONE,
-      NO_REGULAR_SMALL_KEYS
+      NO_REGULAR_SMALL_KEYS,
+      EMPTY_VALUE, // Generate empty value entries.

Review Comment:
   By going over the code, I get the impression that the `ValueOption` 
enumeration is about the generation of the keys of the small table **not** the 
values. Mixing the two creates confusion and makes the code harder to 
understand.



##########
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinCommonOperator.java:
##########
@@ -596,6 +603,16 @@ protected void initializeOp(Configuration hconf) throws 
HiveException {
       outerSmallTableKeyVectorCopy.init(outerSmallTableKeyMapping);
     }
 
+    // Checks if the plan is projecting values from the SmallTable
+    // AND if the number of projected values matches the number of join keys.
+    if (smallTableValueMapping.getCount() > 0 &&
+        smallTableValueMapping.getCount() == bigTableKeyColumnMap.length) {

Review Comment:
   I don't understand the reasoning/intuition behind this check. Why do we care 
about values and keys being of the same length?



##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/TestMapJoinOperator.java:
##########
@@ -1343,6 +1349,93 @@ public void testString2() throws Exception {
     } while (!hiveConfVariationsDone);
   }
 
+  /**
+   * Test case for INNER vector map join with only small table key projection 
- string type.
+   * 
+   * @throws Exception Exception
+   */
+  @Test
+  public void testSmallTableKeyOnlyProjectionWithEmptyValueString() throws 
Exception {

Review Comment:
   Adding tests in this class are useful but it may not be the best option in 
every situation. These tests depend on random generation of input/output and 
they are good for covering general behavior of the joins operators but for edge 
cases and very specific bugs having fixed input & output and join configuration 
would be much easier to reason about.
   
   For showcasing the bug in this PR (if there is one), it would really help to 
have a dedicated test case possibly in another class and have well-defined and 
minimal input/output and join settings. Then we can discuss if we also need 
these randomized tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to