Re: [PR] HIVE-28598: Fix MurmurHashExpression to handle isRepeating properly [hive]

via GitHub Fri, 22 Nov 2024 07:42:40 -0800


zabetak commented on code in PR #5543:
URL: https://github.com/apache/hive/pull/5543#discussion_r1854090784



##########
ql/src/test/queries/clientpositive/vectorized_murmurhash_repeating.q:
##########
@@ -0,0 +1,29 @@
+set hive.llap.io.enabled=false;
+set hive.vectorized.execution.enabled=true;
+
+CREATE TABLE source(id string, id2 string) stored as orc;
+LOAD DATA LOCAL INPATH '../../data/files/2048.orc' overwrite into table source;
+
+-- 2048.orc has 2 columns (id, id2) and 2048 rows.
+-- For the first 1024 rows, both id and id2 are not repeating.
+-- For the last 1024 rows, id is repeating while id2 is not.
+
+-- Test MurmurHashStringColStringCol
+explain
+select sum(murmur_hash(id, id2)) from source;

Review Comment:
   Is the `sum` function necessary for reproducing the problem?



##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestMurmurHashExpression.java:
##########
@@ -241,4 +241,143 @@ public void testMurmurHashWithNullsInt() throws 
HiveException {
           ((LongColumnVector) vrb.cols[2]).vector[i]);
     }
   }
+
+  @Test
+  public void testMurmurHashStringColStringColSingleRepeating() throws 
HiveException {
+    BytesColumnVector cvString1 = (BytesColumnVector) 
ColumnVectorGenUtil.generateColumnVector(
+        TypeInfoFactory.getPrimitiveTypeInfo("string"), false, false, SIZE, 
rand);
+
+    byte[] repeatedValue = new byte[10];
+    rand.nextBytes(repeatedValue);

Review Comment:
   Random makes the tests harder to debug and less predictable so if we can 
avoid it that would be better.



##########
ql/src/test/queries/clientpositive/vectorized_murmurhash_repeating.q:
##########
@@ -0,0 +1,29 @@
+set hive.llap.io.enabled=false;
+set hive.vectorized.execution.enabled=true;
+
+CREATE TABLE source(id string, id2 string) stored as orc;
+LOAD DATA LOCAL INPATH '../../data/files/2048.orc' overwrite into table source;
+
+-- 2048.orc has 2 columns (id, id2) and 2048 rows.
+-- For the first 1024 rows, both id and id2 are not repeating.
+-- For the last 1024 rows, id is repeating while id2 is not.
+
+-- Test MurmurHashStringColStringCol
+explain

Review Comment:
   The simple explain is not very useful in this case since it doesn't tell us 
which and if the expected murmur implementation is used. `EXPLAIN VECTORIZATION 
DETAIL` should show all the necessary info.



##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestMurmurHashExpression.java:
##########
@@ -241,4 +241,143 @@ public void testMurmurHashWithNullsInt() throws 
HiveException {
           ((LongColumnVector) vrb.cols[2]).vector[i]);
     }
   }
+
+  @Test

Review Comment:
   There seems to be a bit of duplication in the tests. Do we have anything to 
factor out?



##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestMurmurHashExpression.java:
##########
@@ -241,4 +241,143 @@ public void testMurmurHashWithNullsInt() throws 
HiveException {
           ((LongColumnVector) vrb.cols[2]).vector[i]);
     }
   }
+
+  @Test
+  public void testMurmurHashStringColStringColSingleRepeating() throws 
HiveException {
+    BytesColumnVector cvString1 = (BytesColumnVector) 
ColumnVectorGenUtil.generateColumnVector(
+        TypeInfoFactory.getPrimitiveTypeInfo("string"), false, false, SIZE, 
rand);
+
+    byte[] repeatedValue = new byte[10];
+    rand.nextBytes(repeatedValue);
+
+    BytesColumnVector cvString2 = new BytesColumnVector(SIZE);
+    cvString2.initBuffer(10);
+    cvString2.noNulls = false;
+    cvString2.isRepeating = true;
+    cvString2.setRef(0, repeatedValue, 0, 10);
+
+    for (int i = 1; i < SIZE; i++) {
+      cvString2.length[i] = 10;
+    }
+
+    VectorizedRowBatch vrb = new VectorizedRowBatch(3, SIZE);
+    vrb.cols[0] = cvString1;
+    vrb.cols[1] = cvString2;
+    vrb.cols[2] = new LongColumnVector(SIZE);
+
+    new MurmurHashStringColStringCol(0, 1, 2).evaluate(vrb);
+
+    Assert.assertEquals(false, vrb.cols[2].isRepeating);
+
+    Text t2 = new Text(repeatedValue);
+    for (int i = 0; i < SIZE; i++) {
+      Text t1 = new Text();
+      t1.set(cvString1.vector[i], cvString1.start[i], cvString1.length[i]);
+
+      Assert.assertEquals(
+          ObjectInspectorUtils.getBucketHashCode(

Review Comment:
   What exactly are we trying to achieve with these assertions? They seem a bit 
useless since the implementation of the function internally calls this method. 
If we want to check the correctness of the method then we should hardcode 
somewhere the expected values and use these in the assert.



##########
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestMurmurHashExpression.java:
##########
@@ -241,4 +241,143 @@ public void testMurmurHashWithNullsInt() throws 
HiveException {
           ((LongColumnVector) vrb.cols[2]).vector[i]);
     }
   }
+
+  @Test
+  public void testMurmurHashStringColStringColSingleRepeating() throws 
HiveException {
+    BytesColumnVector cvString1 = (BytesColumnVector) 
ColumnVectorGenUtil.generateColumnVector(
+        TypeInfoFactory.getPrimitiveTypeInfo("string"), false, false, SIZE, 
rand);
+
+    byte[] repeatedValue = new byte[10];
+    rand.nextBytes(repeatedValue);
+
+    BytesColumnVector cvString2 = new BytesColumnVector(SIZE);
+    cvString2.initBuffer(10);
+    cvString2.noNulls = false;
+    cvString2.isRepeating = true;
+    cvString2.setRef(0, repeatedValue, 0, 10);
+
+    for (int i = 1; i < SIZE; i++) {
+      cvString2.length[i] = 10;
+    }
+
+    VectorizedRowBatch vrb = new VectorizedRowBatch(3, SIZE);
+    vrb.cols[0] = cvString1;
+    vrb.cols[1] = cvString2;
+    vrb.cols[2] = new LongColumnVector(SIZE);
+
+    new MurmurHashStringColStringCol(0, 1, 2).evaluate(vrb);
+
+    Assert.assertEquals(false, vrb.cols[2].isRepeating);

Review Comment:
   Why do we have this assertion? Is it something that we want to enforce?



##########
ql/src/test/queries/clientpositive/vectorized_murmurhash_repeating.q:
##########
@@ -0,0 +1,29 @@
+set hive.llap.io.enabled=false;
+set hive.vectorized.execution.enabled=true;
+
+CREATE TABLE source(id string, id2 string) stored as orc;
+LOAD DATA LOCAL INPATH '../../data/files/2048.orc' overwrite into table source;

Review Comment:
   The .orc are binary files that cannot be easily inspected/reviewed. 
Moreover, ASF imposes various limitations on the content of binary files so 
whenever we can avoid them it is good.
   Is it possible to use a .csv file as an input and use a staging TEXT table 
for loading the data to the ORC table? If it's not possible then that's fine we 
can merge the PR as is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28598: Fix MurmurHashExpression to handle isRepeating properly [hive]

Reply via email to