zabetak commented on code in PR #6165:
URL: https://github.com/apache/hive/pull/6165#discussion_r2549975690
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/MapJoinTestData.java:
##########
@@ -297,14 +297,25 @@ public static void generateVariationData(MapJoinTestData
testData,
private static RowTestObjects
generateRandomSmallTableValueRow(MapJoinTestDescription testDesc, Random
random) {
+ final ValueOption valueOption =
testDesc.getSmallTableGenerationParameters().getValueOption();
final int columnCount = testDesc.smallTableValueTypeInfos.length;
PrimitiveTypeInfo[] primitiveTypeInfos = new
PrimitiveTypeInfo[columnCount];
for (int i = 0; i < columnCount; i++) {
primitiveTypeInfos[i] = (PrimitiveTypeInfo)
testDesc.smallTableValueTypeInfos[i];
}
- Object[] smallTableValueRow =
- VectorRandomRowSource.randomWritablePrimitiveRow(
- columnCount, random, primitiveTypeInfos);
+ Object[] smallTableValueRow;
+
+ // Generate empty value row if specified.
+ if (valueOption == ValueOption.EMPTY_VALUE) {
+ smallTableValueRow = new Object[columnCount];
+ for (int c = 0; c < columnCount; c++) {
+ smallTableValueRow[c] =
+
VectorizedBatchUtil.getPrimitiveWritable(primitiveTypeInfos[c].getPrimitiveCategory());
Review Comment:
Which real use-case is this code trying to simulate? Is this equivalent to
having `null` values in the data or something else. Basically, I am trying to
understand under what circumstances we have these values in the table.
From a quick look, it seems that we are using special values (e.g., `new
Text(ArrayUtils.EMPTY_BYTE_ARRAY)`) but not really nulls. If that's the case
then I don't see why we need to handle this separately from
`VectorRandomRowSource.randomWritablePrimitiveRow`. Wouldn't it make more sense
to tune the random generator to occasioanlly generate this "special" values if
they can really appear in practice?
##########
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinGenerateResultOperator.java:
##########
@@ -226,6 +223,29 @@ protected int
generateHashMapResultSingleValue(VectorizedRowBatch batch,
return numSel;
}
+ private void updateBatchWithSmallTableValue(
+ VectorizedRowBatch batch, VectorMapJoinHashMapResult hashMapResult, int
batchIndex, ByteSegmentRef byteSegmentRef)
+ throws HiveException {
+
+ // Check if the small table value is empty.
+ boolean isSmallTableValueEmpty = byteSegmentRef.getLength() == 0;
Review Comment:
The fact that we need to check the actual data in order to decide how to
evaluate the join (or rather the creation of the resulting row) is somewhat
suspicious and a bit brittle. Ideally, the compiler should be able to determine
exactly how the operator should behave via the query plan. Can we exploit (or
add) information in the query plan in order to drive the copy decision below?
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/TestMapJoinOperator.java:
##########
@@ -1512,8 +1605,17 @@ private RowTestObjectsMultiSet
createExpectedTestRowMultiSet(MapJoinTestDescript
Object[] valueRow = valueList.get(v).getRow();
final int smallTableRetainValueColumnNumsLength =
testDesc.smallTableRetainValueColumnNums.length;
+
+ // When EMPTY_VALUE is specified, the small table value
columns are not
+ // actually stored in the small table HashMap. So we have to
simulate that here.
+ // The expected output should have values from the big table
key columns.
+ final boolean isEmptyValue =
+ testDesc.smallTableGenerationParameters.getValueOption()
== ValueOption.EMPTY_VALUE &&
+ testDesc.smallTableRetainValueColumnNums.length > 0 &&
+ testDesc.smallTableRetainValueColumnNums.length ==
testDesc.bigTableKeyColumnNums.length;
Review Comment:
This simulation is problematic cause it makes the solution and the test code
somewhat identical. We're implementing a copy logic in two places (prod & test)
so the tests will trivially pass as it is right now and immediately fail if the
implementation changes in the future.
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/MapJoinTestDescription.java:
##########
@@ -44,7 +45,8 @@ public static class SmallTableGenerationParameters {
public static enum ValueOption {
NO_RESTRICTION,
ONLY_ONE,
- NO_REGULAR_SMALL_KEYS
+ NO_REGULAR_SMALL_KEYS,
+ EMPTY_VALUE, // Generate empty value entries.
Review Comment:
By going over the code, I get the impression that the `ValueOption`
enumeration is about the generation of the keys of the small table **not** the
values. Mixing the two creates confusion and makes the code harder to
understand.
##########
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinCommonOperator.java:
##########
@@ -596,6 +603,16 @@ protected void initializeOp(Configuration hconf) throws
HiveException {
outerSmallTableKeyVectorCopy.init(outerSmallTableKeyMapping);
}
+ // Checks if the plan is projecting values from the SmallTable
+ // AND if the number of projected values matches the number of join keys.
+ if (smallTableValueMapping.getCount() > 0 &&
+ smallTableValueMapping.getCount() == bigTableKeyColumnMap.length) {
Review Comment:
I don't understand the reasoning/intuition behind this check. Why do we care
about values and keys being of the same length?
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/TestMapJoinOperator.java:
##########
@@ -1343,6 +1349,93 @@ public void testString2() throws Exception {
} while (!hiveConfVariationsDone);
}
+ /**
+ * Test case for INNER vector map join with only small table key projection
- string type.
+ *
+ * @throws Exception Exception
+ */
+ @Test
+ public void testSmallTableKeyOnlyProjectionWithEmptyValueString() throws
Exception {
Review Comment:
Adding tests in this class are useful but it may not be the best option in
every situation. These tests depend on random generation of input/output and
they are good for covering general behavior of the joins operators but for edge
cases and very specific bugs having fixed input & output and join configuration
would be much easier to reason about.
For showcasing the bug in this PR (if there is one), it would really help to
have a dedicated test case possibly in another class and have well-defined and
minimal input/output and join settings. Then we can discuss if we also need
these randomized tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]