[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

ASF GitHub Bot (Jira) Wed, 22 Feb 2023 03:11:16 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692131#comment-17692131
 ]


ASF GitHub Bot commented on PARQUET-2159:
-----------------------------------------

gszadovszky commented on code in PR #1011:
URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1114179342


##########
parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java:
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.encoding.vectorbitpacking;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * This class generates vector bit packers that pack the most significant bit 
first.
+ * The result of the generation is checked in. To regenerate the code run this 
class and check in the result.
+ */
+public class BitPackingGenerator512Vector {
+  private static final String CLASS_NAME_PREFIX_FOR_INT = 
"ByteBitPacking512Vector";
+  private static final String CLASS_NAME_PREFIX_FOR_LONG = 
"ByteBitPacking512VectorForLong";
+
+  public static void main(String[] args) throws Exception {
+    String basePath = args[0];
+    //TODO: Int for Big Endian
+    //generateScheme(false, true, basePath);
+
+    // Int for Little Endian
+    generateScheme(false, false, basePath);
+
+    //TODO: Long for Big Endian
+    //generateScheme(true, true, basePath);
+
+    //TODO: Long for Little Endian
+    //generateScheme(true, false, basePath);
+  }
+
+  private static void generateScheme(boolean isLong, boolean msbFirst,

Review Comment:
   I would suggest to have a separate source directory for java17 that gets 
"activated" only in case of the related profile is activated. This generator 
solution is misleading to me.
   
   To keep this part of code clean I would also suggest to include java17 and 
the related class compile/unit test execution in the github actions so they 
would be executed on the PRs. (The only difference remains is we won't ship 
these in our releases.)
   
   What do you think, @wgtmac, @jiangjiguang?





> Parquet bit-packing de/encode optimization
> ------------------------------------------
>
>                 Key: PARQUET-2159
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2159
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.13.0
>            Reporter: Fang-Xie
>            Assignee: Fang-Xie
>            Priority: Major
>             Fix For: 1.13.0
>
>         Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

Reply via email to