[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692968#comment-17692968 ]
ASF GitHub Bot commented on PARQUET-2159: ----------------------------------------- wgtmac commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1116420531 ########## pom.xml: ########## @@ -151,6 +151,9 @@ <module>parquet-scala</module> <module>parquet-thrift</module> <module>parquet-hadoop-bundle</module> + <!-- + <module>plugins/parquet-encoding-vector</module> Review Comment: This means user has to remove the comment to enable it which seems not very friendly. We can use `<profile>` to enable module optionally. ########## parquet-encoding/src/main/java/org/apache/parquet/column/values/bitpacking/Packer.java: ########## @@ -86,6 +105,7 @@ private static Object getStaticField(String className, String fieldName) { static IntPackerFactory leIntPackerFactory = getIntPackerFactory("LemireBitPackingLE"); static BytePackerFactory beBytePackerFactory = getBytePackerFactory("ByteBitPackingBE"); static BytePackerFactory leBytePackerFactory = getBytePackerFactory("ByteBitPackingLE"); + static BytePackerFactory leBytePacker512VectorFactory = null; Review Comment: Please add a comment to explain leBytePacker512VectorFactory will be initialized lazily. ########## plugins/parquet-encoding-vector/pom.xml: ########## @@ -0,0 +1,131 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <parent> + <groupId>org.apache.parquet</groupId> + <artifactId>parquet</artifactId> + <version>1.13.0-SNAPSHOT</version> + <relativePath>../../pom.xml</relativePath> + </parent> + + <modelVersion>4.0.0</modelVersion> + + <artifactId>parquet-encoding-vector</artifactId> + <packaging>jar</packaging> + + <name>Apache Parquet Encodings Vector</name> + <url>https://parquet.apache.org</url> + + <properties> + <maven.compiler.source>17</maven.compiler.source> Review Comment: If we add a new profile in the root pom file, then we can set compiler version there and avoid hardcode here. ########## README.md: ########## @@ -83,6 +83,16 @@ Parquet is a very active project, and new features are being added quickly. Here * Column stats * Delta encoding * Index pages +* Java Vector API support + +## Java Vector API support +Parquet-MR has supported Java Vector API to speed up reading, to enable the function: +* Java 17+, 64-bit +* For Intel CPU, Flags containing avx512vbmi and avx512_vbmi2 can have better performance gains(ICE Lake or newer processor). +* mvn clean package -P java17-target -P vector Review Comment: Please update the document to reflect the latest change. ########## parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/ByteBitPackingVectorBenchmarks.java: ########## @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.benchmarks; + +import org.apache.parquet.column.values.bitpacking.BytePacker; +import org.apache.parquet.column.values.bitpacking.Packer; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.Warmup; + +import java.util.concurrent.TimeUnit; + +/** + * This class uses the java17 vector API, add VM options --add-modules=jdk.incubator.vector + */ + +@State(Scope.Benchmark) +@BenchmarkMode(Mode.AverageTime) +@Warmup(iterations = 1, batchSize = 100000) +@Measurement(iterations = 1, batchSize = 100000) +@OutputTimeUnit(TimeUnit.MILLISECONDS) +public class ByteBitPackingVectorBenchmarks { Review Comment: If the parquet-encoding-vector does not build, what will happen to this file? Does it fail? ########## plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java: ########## @@ -0,0 +1,3010 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.column.values.bitpacking; + +import jdk.incubator.vector.ByteVector; +import jdk.incubator.vector.IntVector; +import jdk.incubator.vector.LongVector; +import jdk.incubator.vector.ShortVector; +import jdk.incubator.vector.Vector; +import jdk.incubator.vector.VectorMask; +import jdk.incubator.vector.VectorOperators; +import jdk.incubator.vector.VectorShuffle; +import jdk.incubator.vector.VectorSpecies; + +import java.nio.ByteBuffer; + +/** + * Review Comment: Remove it or add something? ########## plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/VectorSupport.java: ########## @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.column.values.bitpacking; + +/** + * Vector bits which has be supported Review Comment: ```suggestion * Supported bit widths to use Vector API. ``` > Parquet bit-packing de/encode optimization > ------------------------------------------ > > Key: PARQUET-2159 > URL: https://issues.apache.org/jira/browse/PARQUET-2159 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Affects Versions: 1.13.0 > Reporter: Fang-Xie > Assignee: Fang-Xie > Priority: Major > Fix For: 1.13.0 > > Attachments: image-2022-06-15-22-56-08-396.png, > image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, > image-2022-06-15-22-58-40-704.png > > > Current Spark use Parquet-mr as parquet reader/writer library, but the > built-in bit-packing en/decode is not efficient enough. > Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector > in Open JDK18 brings prominent performance improvement. > Due to Vector API is added to OpenJDK since 16, So this optimization request > JDK16 or higher. > *Below are our test results* > Functional test is based on open-source parquet-mr Bit-pack decoding > function: *_public final void unpack8Values(final byte[] in, final int inPos, > final int[] out, final int outPos)_* __ > compared with our implementation with vector API *_public final void > unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final > int outPos)_* > We tested 10 pairs (open source parquet bit unpacking vs ours optimized > vectorized SIMD implementation) decode function with bit > width=\{1,2,3,4,5,6,7,8,9,10}, below are test results: > !image-2022-06-15-22-56-08-396.png|width=437,height=223! > We integrated our bit-packing decode implementation into parquet-mr, tested > the parquet batch reader ability from Spark VectorizedParquetRecordReader > which get parquet column data by the batch way. We construct parquet file > with different row count and column count, the column data type is Int32, the > maximum int value is 127 which satisfies bit pack encode with bit width=7, > the count of the row is from 10k to 100 million and the count of the column > is from 1 to 4. > !image-2022-06-15-22-57-15-964.png|width=453,height=229! > !image-2022-06-15-22-58-01-442.png|width=439,height=217! > !image-2022-06-15-22-58-40-704.png|width=415,height=208! -- This message was sent by Atlassian Jira (v8.20.10#820010)