zabetak commented on code in PR #5543:
URL: https://github.com/apache/hive/pull/5543#discussion_r1854090784
##########
ql/src/test/queries/clientpositive/vectorized_murmurhash_repeating.q:
##########
@@ -0,0 +1,29 @@
+set hive.llap.io.enabled=false;
+set hive.vectorized.execution.enabled=true;
+
+CREATE TABLE source(id string, id2 string) stored as orc;
+LOAD DATA LOCAL INPATH '../../data/files/2048.orc' overwrite into table source;
+
+-- 2048.orc has 2 columns (id, id2) and 2048 rows.
+-- For the first 1024 rows, both id and id2 are not repeating.
+-- For the last 1024 rows, id is repeating while id2 is not.
+
+-- Test MurmurHashStringColStringCol
+explain
+select sum(murmur_hash(id, id2)) from source;
Review Comment:
Is the `sum` function necessary for reproducing the problem?
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestMurmurHashExpression.java:
##########
@@ -241,4 +241,143 @@ public void testMurmurHashWithNullsInt() throws
HiveException {
((LongColumnVector) vrb.cols[2]).vector[i]);
}
}
+
+ @Test
+ public void testMurmurHashStringColStringColSingleRepeating() throws
HiveException {
+ BytesColumnVector cvString1 = (BytesColumnVector)
ColumnVectorGenUtil.generateColumnVector(
+ TypeInfoFactory.getPrimitiveTypeInfo("string"), false, false, SIZE,
rand);
+
+ byte[] repeatedValue = new byte[10];
+ rand.nextBytes(repeatedValue);
Review Comment:
Random makes the tests harder to debug and less predictable so if we can
avoid it that would be better.
##########
ql/src/test/queries/clientpositive/vectorized_murmurhash_repeating.q:
##########
@@ -0,0 +1,29 @@
+set hive.llap.io.enabled=false;
+set hive.vectorized.execution.enabled=true;
+
+CREATE TABLE source(id string, id2 string) stored as orc;
+LOAD DATA LOCAL INPATH '../../data/files/2048.orc' overwrite into table source;
+
+-- 2048.orc has 2 columns (id, id2) and 2048 rows.
+-- For the first 1024 rows, both id and id2 are not repeating.
+-- For the last 1024 rows, id is repeating while id2 is not.
+
+-- Test MurmurHashStringColStringCol
+explain
Review Comment:
The simple explain is not very useful in this case since it doesn't tell us
which and if the expected murmur implementation is used. `EXPLAIN VECTORIZATION
DETAIL` should show all the necessary info.
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestMurmurHashExpression.java:
##########
@@ -241,4 +241,143 @@ public void testMurmurHashWithNullsInt() throws
HiveException {
((LongColumnVector) vrb.cols[2]).vector[i]);
}
}
+
+ @Test
Review Comment:
There seems to be a bit of duplication in the tests. Do we have anything to
factor out?
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestMurmurHashExpression.java:
##########
@@ -241,4 +241,143 @@ public void testMurmurHashWithNullsInt() throws
HiveException {
((LongColumnVector) vrb.cols[2]).vector[i]);
}
}
+
+ @Test
+ public void testMurmurHashStringColStringColSingleRepeating() throws
HiveException {
+ BytesColumnVector cvString1 = (BytesColumnVector)
ColumnVectorGenUtil.generateColumnVector(
+ TypeInfoFactory.getPrimitiveTypeInfo("string"), false, false, SIZE,
rand);
+
+ byte[] repeatedValue = new byte[10];
+ rand.nextBytes(repeatedValue);
+
+ BytesColumnVector cvString2 = new BytesColumnVector(SIZE);
+ cvString2.initBuffer(10);
+ cvString2.noNulls = false;
+ cvString2.isRepeating = true;
+ cvString2.setRef(0, repeatedValue, 0, 10);
+
+ for (int i = 1; i < SIZE; i++) {
+ cvString2.length[i] = 10;
+ }
+
+ VectorizedRowBatch vrb = new VectorizedRowBatch(3, SIZE);
+ vrb.cols[0] = cvString1;
+ vrb.cols[1] = cvString2;
+ vrb.cols[2] = new LongColumnVector(SIZE);
+
+ new MurmurHashStringColStringCol(0, 1, 2).evaluate(vrb);
+
+ Assert.assertEquals(false, vrb.cols[2].isRepeating);
+
+ Text t2 = new Text(repeatedValue);
+ for (int i = 0; i < SIZE; i++) {
+ Text t1 = new Text();
+ t1.set(cvString1.vector[i], cvString1.start[i], cvString1.length[i]);
+
+ Assert.assertEquals(
+ ObjectInspectorUtils.getBucketHashCode(
Review Comment:
What exactly are we trying to achieve with these assertions? They seem a bit
useless since the implementation of the function internally calls this method.
If we want to check the correctness of the method then we should hardcode
somewhere the expected values and use these in the assert.
##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestMurmurHashExpression.java:
##########
@@ -241,4 +241,143 @@ public void testMurmurHashWithNullsInt() throws
HiveException {
((LongColumnVector) vrb.cols[2]).vector[i]);
}
}
+
+ @Test
+ public void testMurmurHashStringColStringColSingleRepeating() throws
HiveException {
+ BytesColumnVector cvString1 = (BytesColumnVector)
ColumnVectorGenUtil.generateColumnVector(
+ TypeInfoFactory.getPrimitiveTypeInfo("string"), false, false, SIZE,
rand);
+
+ byte[] repeatedValue = new byte[10];
+ rand.nextBytes(repeatedValue);
+
+ BytesColumnVector cvString2 = new BytesColumnVector(SIZE);
+ cvString2.initBuffer(10);
+ cvString2.noNulls = false;
+ cvString2.isRepeating = true;
+ cvString2.setRef(0, repeatedValue, 0, 10);
+
+ for (int i = 1; i < SIZE; i++) {
+ cvString2.length[i] = 10;
+ }
+
+ VectorizedRowBatch vrb = new VectorizedRowBatch(3, SIZE);
+ vrb.cols[0] = cvString1;
+ vrb.cols[1] = cvString2;
+ vrb.cols[2] = new LongColumnVector(SIZE);
+
+ new MurmurHashStringColStringCol(0, 1, 2).evaluate(vrb);
+
+ Assert.assertEquals(false, vrb.cols[2].isRepeating);
Review Comment:
Why do we have this assertion? Is it something that we want to enforce?
##########
ql/src/test/queries/clientpositive/vectorized_murmurhash_repeating.q:
##########
@@ -0,0 +1,29 @@
+set hive.llap.io.enabled=false;
+set hive.vectorized.execution.enabled=true;
+
+CREATE TABLE source(id string, id2 string) stored as orc;
+LOAD DATA LOCAL INPATH '../../data/files/2048.orc' overwrite into table source;
Review Comment:
The .orc are binary files that cannot be easily inspected/reviewed.
Moreover, ASF imposes various limitations on the content of binary files so
whenever we can avoid them it is good.
Is it possible to use a .csv file as an input and use a staging TEXT table
for loading the data to the ORC table? If it's not possible then that's fine we
can merge the PR as is.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]