[GitHub] [spark] LuciferYang commented on a change in pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path

2022-01-08 Thread GitBox


LuciferYang commented on a change in pull request #34471:
URL: https://github.com/apache/spark/pull/34471#discussion_r780024037



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
##
@@ -76,6 +79,7 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark {
 saveAsCsvTable(testDf, dir.getCanonicalPath + "/csv")
 saveAsJsonTable(testDf, dir.getCanonicalPath + "/json")
 saveAsParquetTable(testDf, dir.getCanonicalPath + "/parquet")
+saveAsParquetV2Table(testDf, dir.getCanonicalPath + "/parquetV2")

Review comment:
   I found that there are still unsupported encoding in Data Page V2, such 
as RLE for Boolean. It seems that it is not time to update the benchmark, 
please ignore my previous comments
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path

2022-01-07 Thread GitBox


LuciferYang commented on a change in pull request #34471:
URL: https://github.com/apache/spark/pull/34471#discussion_r780024037



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
##
@@ -76,6 +79,7 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark {
 saveAsCsvTable(testDf, dir.getCanonicalPath + "/csv")
 saveAsJsonTable(testDf, dir.getCanonicalPath + "/json")
 saveAsParquetTable(testDf, dir.getCanonicalPath + "/parquet")
+saveAsParquetV2Table(testDf, dir.getCanonicalPath + "/parquetV2")

Review comment:
   I found that there are still unsupported encoding in Data Page V2, such 
as RLE for Boolean. It seems that it is not time to update the benchmark, 
please ignore my previous comments
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path

2022-01-06 Thread GitBox


LuciferYang commented on a change in pull request #34471:
URL: https://github.com/apache/spark/pull/34471#discussion_r779406568



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
##
@@ -76,6 +79,7 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark {
 saveAsCsvTable(testDf, dir.getCanonicalPath + "/csv")
 saveAsJsonTable(testDf, dir.getCanonicalPath + "/json")
 saveAsParquetTable(testDf, dir.getCanonicalPath + "/parquet")
+saveAsParquetV2Table(testDf, dir.getCanonicalPath + "/parquetV2")

Review comment:
   ~~Maybe we should update the benchmark-result of 
`DataSourceReadBenchmark`~~ @parthchandra




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path

2022-01-06 Thread GitBox


LuciferYang commented on a change in pull request #34471:
URL: https://github.com/apache/spark/pull/34471#discussion_r780024037



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
##
@@ -76,6 +79,7 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark {
 saveAsCsvTable(testDf, dir.getCanonicalPath + "/csv")
 saveAsJsonTable(testDf, dir.getCanonicalPath + "/json")
 saveAsParquetTable(testDf, dir.getCanonicalPath + "/parquet")
+saveAsParquetV2Table(testDf, dir.getCanonicalPath + "/parquetV2")

Review comment:
   I found that there are still unsupported encoding in Data Page V2, such 
as RLE. It seems that it is not time to update the benchmark, please ignore my 
previous comments
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path

2022-01-06 Thread GitBox


LuciferYang commented on a change in pull request #34471:
URL: https://github.com/apache/spark/pull/34471#discussion_r779409193



##
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java
##
@@ -0,0 +1,319 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.io.IOException;
+import java.math.BigInteger;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.bytes.ByteBufferInputStream;
+import org.apache.parquet.bytes.BytesUtils;
+import org.apache.parquet.column.values.bitpacking.BytePackerForLong;
+import org.apache.parquet.column.values.bitpacking.Packer;
+import org.apache.parquet.io.ParquetDecodingException;
+import org.apache.spark.sql.catalyst.util.RebaseDateTime;
+import org.apache.spark.sql.execution.datasources.DataSourceUtils;
+import org.apache.spark.sql.execution.vectorized.WritableColumnVector;
+
+/**
+ * An implementation of the Parquet DELTA_BINARY_PACKED decoder that supports 
the vectorized
+ * interface. DELTA_BINARY_PACKED is a delta encoding for integer and long 
types that stores values
+ * as a delta between consecutive values. Delta values are themselves bit 
packed. Similar to RLE but
+ * is more effective in the case of large variation of values in the encoded 
column.
+ * 
+ * DELTA_BINARY_PACKED is the default encoding for integer and long columns in 
Parquet V2.
+ * 
+ * Supported Types: INT32, INT64
+ * 
+ *
+ * @see https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5";>
+ * Parquet format encodings: DELTA_BINARY_PACKED
+ */
+public class VectorizedDeltaBinaryPackedReader extends VectorizedReaderBase {
+
+  // header data
+  private int blockSizeInValues;
+  private int miniBlockNumInABlock;
+  private int totalValueCount;
+  private long firstValue;
+
+  private int miniBlockSizeInValues;
+
+  // values read by the caller
+  private int valuesRead = 0;
+
+  // variables to keep state of the current block and miniblock
+  private long lastValueRead;  // needed to compute the next value
+  private long minDeltaInCurrentBlock; // needed to compute the next value
+  // currentMiniBlock keeps track of the mini block within the current block 
that
+  // we read and decoded most recently. Only used as an index into
+  // bitWidths array
+  private int currentMiniBlock = 0;
+  private int[] bitWidths; // bit widths for each miniBlock in the current 
block
+  private int remainingInBlock = 0; // values in current block still to be read
+  private int remainingInMiniBlock = 0; // values in current mini block still 
to be read
+  private long[] unpackedValuesBuffer;
+
+  private ByteBufferInputStream in;
+
+  // temporary buffers used by readByte, readShort, readInteger, and readLong
+  byte byteVal;

Review comment:
   Should these 4 field be private?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path

2022-01-06 Thread GitBox


LuciferYang commented on a change in pull request #34471:
URL: https://github.com/apache/spark/pull/34471#discussion_r779406568



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
##
@@ -76,6 +79,7 @@ object DataSourceReadBenchmark extends SqlBasedBenchmark {
 saveAsCsvTable(testDf, dir.getCanonicalPath + "/csv")
 saveAsJsonTable(testDf, dir.getCanonicalPath + "/json")
 saveAsParquetTable(testDf, dir.getCanonicalPath + "/parquet")
+saveAsParquetV2Table(testDf, dir.getCanonicalPath + "/parquetV2")

Review comment:
   Maybe we should update the benchmark-result of `DataSourceReadBenchmark` 
@parthchandra parthchandra




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org