[GitHub] [spark] viirya commented on a change in pull request #33330: [SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values correctly

2021-07-14 Thread GitBox


viirya commented on a change in pull request #0:
URL: https://github.com/apache/spark/pull/0#discussion_r669323359



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##
@@ -255,6 +253,39 @@ private void readBatchInternal(
 state.advanceOffsetAndRowId(offset, rowId);
   }
 
+  /**
+   * Skip the next `n` values (either null or non-null) from this definition 
level reader and
+   * `valueReader`.
+   */
+  private void skipValues(
+  int n,
+  ParquetReadState state,
+  VectorizedValuesReader valuesReader,
+  ParquetVectorUpdater updater) {
+while (n > 0) {
+  if (this.currentCount == 0) this.readNextGroup();
+  int num = Math.min(n, this.currentCount);
+  switch (mode) {
+case RLE:
+  // we only need to skip non-null values from `valuesReader` since 
nulls are represented
+  // via definition levels which are skipped here via decrementing 
`currentCount`.
+  if (currentValue == state.maxDefinitionLevel) {
+updater.skipValues(num, valuesReader);
+  }
+  break;
+case PACKED:
+  for (int i = 0; i < num; ++i) {
+// save as above, only skip non-null values from `valuesReader`

Review comment:
   save? same?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #33330: [SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values correctly

2021-07-13 Thread GitBox


viirya commented on a change in pull request #0:
URL: https://github.com/apache/spark/pull/0#discussion_r669311526



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnIndexSuite.scala
##
@@ -31,24 +31,25 @@ class ParquetColumnIndexSuite extends QueryTest with 
ParquetTest with SharedSpar
*  |---|-|-|---|---|---|---|---|
* col_2   400   300   200 200 200 200 200 200

Review comment:
   And also the PR description. :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #33330: [SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values correctly

2021-07-13 Thread GitBox


viirya commented on a change in pull request #0:
URL: https://github.com/apache/spark/pull/0#discussion_r669310575



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##
@@ -358,7 +386,14 @@ public void skipIntegers(int total) {
 while (left > 0) {
   if (this.currentCount == 0) this.readNextGroup();
   int n = Math.min(left, this.currentCount);
-  advance(n);
+  switch (mode) {
+case RLE:
+  break;
+case PACKED:
+  currentBufferIdx += n;
+  break;
+  }
+  currentCount -= n;

Review comment:
   Do you remove `advance()` method? I don't see it is removed here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #33330: [SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values correctly

2021-07-13 Thread GitBox


viirya commented on a change in pull request #0:
URL: https://github.com/apache/spark/pull/0#discussion_r669309094



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##
@@ -358,7 +386,14 @@ public void skipIntegers(int total) {
 while (left > 0) {
   if (this.currentCount == 0) this.readNextGroup();
   int n = Math.min(left, this.currentCount);
-  advance(n);
+  switch (mode) {
+case RLE:
+  break;
+case PACKED:
+  currentBufferIdx += n;
+  break;
+  }
+  currentCount -= n;

Review comment:
   Hmm, this is a old bug?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #33330: [SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values correctly

2021-07-13 Thread GitBox


viirya commented on a change in pull request #0:
URL: https://github.com/apache/spark/pull/0#discussion_r669308507



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##
@@ -255,6 +253,36 @@ private void readBatchInternal(
 state.advanceOffsetAndRowId(offset, rowId);
   }
 
+  /**
+   * Skip the next `n` values (either null or non-null) from this definition 
level reader and
+   * `valueReader`.
+   */
+  private void skipValues(
+  int n,
+  ParquetReadState state,
+  VectorizedValuesReader valuesReader,
+  ParquetVectorUpdater updater) {
+while (n > 0) {
+  if (this.currentCount == 0) this.readNextGroup();
+  int num = Math.min(n, this.currentCount);
+  switch (mode) {
+case RLE:
+  if (currentValue == state.maxDefinitionLevel) {
+updater.skipValues(num, valuesReader);
+  }
+  break;
+case PACKED:
+  for (int i = 0; i < num; ++i) {
+if (currentBuffer[currentBufferIdx++] == state.maxDefinitionLevel) 
{
+  updater.skipValues(1, valuesReader);
+}

Review comment:
   ditto.
   
   If so, maybe it is good to add a comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #33330: [SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values correctly

2021-07-13 Thread GitBox


viirya commented on a change in pull request #0:
URL: https://github.com/apache/spark/pull/0#discussion_r669308360



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##
@@ -255,6 +253,36 @@ private void readBatchInternal(
 state.advanceOffsetAndRowId(offset, rowId);
   }
 
+  /**
+   * Skip the next `n` values (either null or non-null) from this definition 
level reader and
+   * `valueReader`.
+   */
+  private void skipValues(
+  int n,
+  ParquetReadState state,
+  VectorizedValuesReader valuesReader,
+  ParquetVectorUpdater updater) {
+while (n > 0) {
+  if (this.currentCount == 0) this.readNextGroup();
+  int num = Math.min(n, this.currentCount);
+  switch (mode) {
+case RLE:
+  if (currentValue == state.maxDefinitionLevel) {
+updater.skipValues(num, valuesReader);
+  }

Review comment:
   Is else case for null value?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #33330: [SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values correctly

2021-07-13 Thread GitBox


viirya commented on a change in pull request #0:
URL: https://github.com/apache/spark/pull/0#discussion_r669306577



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnIndexSuite.scala
##
@@ -31,24 +31,25 @@ class ParquetColumnIndexSuite extends QueryTest with 
ParquetTest with SharedSpar
*  |---|-|-|---|---|---|---|---|
* col_2   400   300   200 200 200 200 200 200

Review comment:
   BTW, about the layout, the total of col_1 or col_2 is all 2000? But 
looks like col_2 has only 1900?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org