[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-28 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1121218282


##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]

Review Comment:
   To reduce some resource consumption, `uncompressed` seems enough.



##
.github/workflows/vector-plugins.yml:
##
@@ -0,0 +1,56 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Vector-plugins
+
+on: [push, pull_request]
+
+jobs:
+  build:
+
+runs-on: ubuntu-latest
+strategy:
+  fail-fast: false
+  matrix:
+java: [ '17' ]
+codes: [ 'uncompressed,brotli', 'gzip,snappy' ]
+name: Build Parquet with JDK ${{ matrix.java }} and ${{ matrix.codes }}
+
+steps:
+  - uses: actions/checkout@master
+  - name: Set up JDK ${{ matrix.java }}
+uses: actions/setup-java@v1
+with:
+  java-version: ${{ matrix.java }}
+  - name: before_install
+env:
+  CI_TARGET_BRANCH: $GITHUB_HEAD_REF
+run: |
+  bash dev/ci-before_install.sh
+  - name: install
+run: |
+  EXTRA_JAVA_TEST_ARGS=$(mvn help:evaluate 
-Dexpression=extraJavaTestArgs -q -DforceStdout)
+  export MAVEN_OPTS="$MAVEN_OPTS $EXTRA_JAVA_TEST_ARGS"
+  mvn install --batch-mode -Pvector-plugins -DskipTests=true 
-Dmaven.javadoc.skip=true -Dsource.skip=true -Djava.version=${{ matrix.java }} 
-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift

Review Comment:
   Why is `-pl 
-parquet-hadoop,-parquet-arrow,-parquet-avro,-parquet-benchmarks,-parquet-cli,-parquet-column,-parquet-hadoop-bundle,-parquet-jackson,-parquet-pig,-parquet-pig-bundle,-parquet-protobuf,-parquet-thrift`
 required here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-27 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1119478321


##
pom.xml:
##
@@ -659,5 +662,13 @@
 
   
 
+
+
+  plugins

Review Comment:
   `plugins` is a little bit generic to me. Rename it to `encoding-plugin` or 
`vector-plugins`? Any suggestion?  @gszadovszky 
   
   In addition, please rename `plugins` folder to `parquet-plugins` to follow 
naming of other sub-directories.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-26 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1118319141


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */
+public class BitPackingGenerator512Vector {
+  private static final String CLASS_NAME_PREFIX_FOR_INT = 
"ByteBitPacking512Vector";
+  private static final String CLASS_NAME_PREFIX_FOR_LONG = 
"ByteBitPacking512VectorForLong";
+
+  public static void main(String[] args) throws Exception {
+String basePath = args[0];
+//TODO: Int for Big Endian
+//generateScheme(false, true, basePath);
+
+// Int for Little Endian
+generateScheme(false, false, basePath);
+
+//TODO: Long for Big Endian
+//generateScheme(true, true, basePath);
+
+//TODO: Long for Little Endian
+//generateScheme(true, false, basePath);
+  }
+
+  private static void generateScheme(boolean isLong, boolean msbFirst,

Review Comment:
   Not sure whether this answer works in this case: 
https://stackoverflow.com/a/13383092



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-23 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1116427502


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */
+public class BitPackingGenerator512Vector {
+  private static final String CLASS_NAME_PREFIX_FOR_INT = 
"ByteBitPacking512Vector";
+  private static final String CLASS_NAME_PREFIX_FOR_LONG = 
"ByteBitPacking512VectorForLong";
+
+  public static void main(String[] args) throws Exception {
+String basePath = args[0];
+//TODO: Int for Big Endian
+//generateScheme(false, true, basePath);
+
+// Int for Little Endian
+generateScheme(false, false, basePath);
+
+//TODO: Long for Big Endian
+//generateScheme(true, true, basePath);
+
+//TODO: Long for Little Endian
+//generateScheme(true, false, basePath);
+  }
+
+  private static void generateScheme(boolean isLong, boolean msbFirst,

Review Comment:
   +1 for the option 2. But why not enable it by a new profile?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-23 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1116420531


##
pom.xml:
##
@@ -151,6 +151,9 @@
 parquet-scala
 parquet-thrift
 parquet-hadoop-bundle
+
+http://maven.apache.org/POM/4.0.0;
+ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
+  
+org.apache.parquet
+parquet
+1.13.0-SNAPSHOT
+../../pom.xml
+  
+
+  4.0.0
+
+  parquet-encoding-vector
+  jar
+
+  Apache Parquet Encodings Vector
+  https://parquet.apache.org
+
+  
+17

Review Comment:
   If we add a new profile in the root pom file, then we can set compiler 
version there and avoid hardcode here.



##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support
+Parquet-MR has supported Java Vector API to speed up reading, to enable the 
function:
+* Java 17+, 64-bit
+* For Intel CPU, Flags containing avx512vbmi and avx512_vbmi2 can have better 
performance gains(ICE Lake or newer processor).
+* mvn clean package -P java17-target -P vector

Review Comment:
   Please update the document to reflect the latest change.



##
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java:
##
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.benchmarks;
+
+import org.apache.parquet.column.values.bitpacking.BytePacker;
+import org.apache.parquet.column.values.bitpacking.Packer;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class uses the java17 vector API, add VM options 
--add-modules=jdk.incubator.vector
+ */
+
+@State(Scope.Benchmark)
+@BenchmarkMode(Mode.AverageTime)
+@Warmup(iterations = 1, batchSize = 10)
+@Measurement(iterations = 1, batchSize = 10)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+public class ByteBitPackingVectorBenchmarks {

Review Comment:
   If the parquet-encoding-vector does not build, what will happen to this 
file? Does it fail?



##
plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java:
##
@@ -0,0 +1,3010 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ *

Review Comment:
   Remove it or add 

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-22 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1114220877


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */
+public class BitPackingGenerator512Vector {
+  private static final String CLASS_NAME_PREFIX_FOR_INT = 
"ByteBitPacking512Vector";
+  private static final String CLASS_NAME_PREFIX_FOR_LONG = 
"ByteBitPacking512VectorForLong";
+
+  public static void main(String[] args) throws Exception {
+String basePath = args[0];
+//TODO: Int for Big Endian
+//generateScheme(false, true, basePath);
+
+// Int for Little Endian
+generateScheme(false, false, basePath);
+
+//TODO: Long for Big Endian
+//generateScheme(true, true, basePath);
+
+//TODO: Long for Little Endian
+//generateScheme(true, false, basePath);
+  }
+
+  private static void generateScheme(boolean isLong, boolean msbFirst,

Review Comment:
   +1 for @gszadovszky 
   
   The core idea is to isolate the new feature as an individual unit and 
minimize the burden of future maintenance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-22 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1114064922


##
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column.values.bitpacking;
+
+import org.apache.parquet.bytes.ByteBufferInputStream;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.EOFException;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Utility class for big data applications (such as Apache Spark and Apache 
Flink).
+ * For Intel CPU, Flags containing avx512vbmi and avx512_vbmi2 can have better 
performance gains.
+ */
+public class ParquetReadRouter {
+  private static final Logger LOG = 
LoggerFactory.getLogger(ParquetReadRouter.class);
+
+  private static final int BITS_PER_BYTE = 8;
+
+  // register of avx512 are 512 bits, and can load up to 64 bytes
+  private static final int BYTES_PER_VECTOR_512 = 64;
+
+  // values are bit packed 8 at a time, so reading bitWidth will always work
+  private static final int NUM_VALUES_TO_PACK = 8;
+
+  private static final VectorSupport vectorSupport;
+
+  static {
+vectorSupport = getSupportVectorFromCPUFlags();
+  }
+
+  // Dispatches to use vector when available. Directly call 
readBatchUsing512Vector() if you are sure about it.
+  public static void read(int bitWidth, ByteBufferInputStream in, int 
currentCount, int[] currentBuffer) throws IOException {
+switch (vectorSupport) {
+  case VECTOR_512:
+readBatchUsing512Vector(bitWidth, in, currentCount, currentBuffer);
+break;
+  default:
+readBatch(bitWidth, in, currentCount, currentBuffer);
+}
+  }
+
+  // Call the method directly if your computer system contains avx512vbmi and 
avx512_vbmi2 CPU Flags
+  public static void readBatchUsing512Vector(int bitWidth, 
ByteBufferInputStream in, int currentCount, int[] currentBuffer) throws 
IOException {
+BytePacker packer = Packer.LITTLE_ENDIAN.newBytePacker(bitWidth);
+BytePacker packerVector = 
Packer.LITTLE_ENDIAN.newBytePackerVector(bitWidth);
+int valueIndex = 0;
+int byteIndex = 0;
+int unpackCount = packerVector.getUnpackCount();
+int inputByteCountPerVector = packerVector.getUnpackCount() / 
BITS_PER_BYTE * bitWidth;
+int totalByteCount = currentCount * bitWidth / BITS_PER_BYTE;
+int totalByteCountVector = totalByteCount - BYTES_PER_VECTOR_512;
+ByteBuffer buffer = in.slice(totalByteCount);
+if (buffer.hasArray()) {
+  for (; byteIndex < totalByteCountVector; byteIndex += 
inputByteCountPerVector, valueIndex += unpackCount) {
+packerVector.unpackValuesUsingVector(buffer.array(), 
buffer.arrayOffset() + buffer.position() + byteIndex, currentBuffer, 
valueIndex);
+  }
+  // If the remaining bytes size <= {BYTES_PER_512VECTOR}, the remaining 
bytes are unpacked by packer
+  for (; byteIndex < totalByteCount; byteIndex += bitWidth, valueIndex += 
NUM_VALUES_TO_PACK) {
+packer.unpack8Values(buffer.array(), buffer.arrayOffset() + 
buffer.position() + byteIndex, currentBuffer, valueIndex);
+  }
+} else {
+  for (; byteIndex < totalByteCountVector; byteIndex += 
inputByteCountPerVector, valueIndex += unpackCount) {
+packerVector.unpackValuesUsingVector(buffer, buffer.position() + 
byteIndex, currentBuffer, valueIndex);
+  }
+  for (; byteIndex < totalByteCount; byteIndex += bitWidth, valueIndex += 
NUM_VALUES_TO_PACK) {
+packer.unpack8Values(buffer, buffer.position() + byteIndex, 
currentBuffer, valueIndex);
+  }
+}
+  }
+
+  // Call the method directly if your computer system doesn't contain 
avx512vbmi and avx512_vbmi2 CPU Flags
+  public static void readBatch(int bitWidth, ByteBufferInputStream in, int 
currentCount, int[] currentBuffer) throws EOFException {
+BytePacker packer = 

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-22 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1114063818


##
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */
+public class BitPackingGenerator512Vector {
+  private static final String CLASS_NAME_PREFIX_FOR_INT = 
"ByteBitPacking512Vector";
+  private static final String CLASS_NAME_PREFIX_FOR_LONG = 
"ByteBitPacking512VectorForLong";
+
+  public static void main(String[] args) throws Exception {
+String basePath = args[0];
+//TODO: Int for Big Endian
+//generateScheme(false, true, basePath);
+
+// Int for Little Endian
+generateScheme(false, false, basePath);
+
+//TODO: Long for Big Endian
+//generateScheme(true, true, basePath);
+
+//TODO: Long for Little Endian
+//generateScheme(true, false, basePath);
+  }
+
+  private static void generateScheme(boolean isLong, boolean msbFirst,

Review Comment:
   > So by generating this class simply copies the file form `resources` to a 
source dir under the name of `.java`. Why do we need code generation then?
   
   I guess the reason is that it does not compile without java17.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-06 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1097538545


##
README.md:
##
@@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being 
added quickly. Here
 * Column stats
 * Delta encoding
 * Index pages
+* Java Vector API support
+
+## Java Vector API support

Review Comment:
   @dongjoon-hyun @sunchao Could you please take a look and advise if it is 
possible to integrate it with Apache Spark? Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-01 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1094047290


##
parquet-generator/src/main/resources/ByteBitPacking512VectorLE:
##
@@ -0,0 +1,3095 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPacking512VectorLE {

Review Comment:
   In this case, the script is not necessary. Manual bit-unpacking code is 
error-prone, we really rely on the quality and coverage of test cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-01-31 Thread via GitHub


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1092845969


##
parquet-generator/src/main/resources/ByteBitPacking512VectorLE:
##
@@ -0,0 +1,3095 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;
+
+import jdk.incubator.vector.ByteVector;
+import jdk.incubator.vector.IntVector;
+import jdk.incubator.vector.LongVector;
+import jdk.incubator.vector.ShortVector;
+import jdk.incubator.vector.Vector;
+import jdk.incubator.vector.VectorMask;
+import jdk.incubator.vector.VectorOperators;
+import jdk.incubator.vector.VectorShuffle;
+import jdk.incubator.vector.VectorSpecies;
+
+import java.nio.ByteBuffer;
+
+/**
+ * This is an auto-generated source file and should not edit it directly.
+ */
+public abstract class ByteBitPacking512VectorLE {

Review Comment:
   Do you have any script to generate the code here? If true, it would be great 
to commit it as well.



##
parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java:
##
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.benchmarks;
+
+import org.apache.parquet.column.values.bitpacking.BytePacker;
+import org.apache.parquet.column.values.bitpacking.Packer;
+import org.openjdk.jmh.annotations.*;
+
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class uses the java17 vector API, add VM options 
--add-modules=jdk.incubator.vector
+ */
+
+@State(Scope.Benchmark)
+@BenchmarkMode(Mode.AverageTime)
+@Warmup(iterations = 1, batchSize = 10)
+@Measurement(iterations = 1, batchSize = 10)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+public class ByteBitPackingVectorBenchmarks {
+
+  /**
+   * The range of bitWidth is 1 ~ 32, change it directly if test other 
bitWidth.
+   */
+  private static final int bitWidth = 7;
+  private static final int outputValues = 1024;
+  private final byte[] input = new byte[outputValues * bitWidth / 8];
+  private final int[] output = new int[outputValues];
+  private final int[] outputVector = new int[outputValues];
+
+  @Setup(Level.Trial)
+  public void getInputBytes() {
+for (int i = 0; i < input.length; i++) {
+  input[i] = (byte) i;
+}
+  }
+
+  @Benchmark
+  public void testUnpack() {
+BytePacker bytePacker = Packer.LITTLE_ENDIAN.newBytePacker(bitWidth);
+for (int i = 0, j = 0; i < input.length; i += bitWidth, j += 8) {
+  bytePacker.unpack8Values(input, i, output, j);
+}
+  }
+
+  @Benchmark
+  public void testUnpackVector() {
+BytePacker bytePacker = Packer.LITTLE_ENDIAN.newBytePacker(bitWidth);
+BytePacker bytePackerVector = 
Packer.LITTLE_ENDIAN.newBytePackerVector(bitWidth);

Review Comment:
   Could you elaborate more? @jatin-bhateja 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2022-12-06 Thread GitBox


wgtmac commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1040642048


##
parquet-generator/src/main/resources/ByteBitPackingVectorLE:
##
@@ -0,0 +1,3215 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.column.values.bitpacking;

Review Comment:
   Should we add a comment here to let reader know this is an auto-generated 
source file and should not edit it directly? Just in case.



##
parquet-generator/src/main/java/org/apache/parquet/encoding/bitpacking/ByteBasedBitPackingVectorGenerator.java:
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.bitpacking;
+
+import java.io.*;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.

Review Comment:
   Why is this intentional?



##
parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java:
##
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column.values.bitpacking;
+
+import org.apache.parquet.bytes.ByteBufferInputStream;
+
+import java.io.EOFException;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+public class ParquetReadRouter {

Review Comment:
   It seems that this class is not used anywhere.



##
parquet-generator/src/main/java/org/apache/parquet/encoding/GeneratorVector.java:
##
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding;
+
+import 
org.apache.parquet.encoding.bitpacking.ByteBasedBitPackingVectorGenerator;
+
+/**
+ * main class for code generation hook in build for encodings vector generation
+ */