[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610358#comment-17610358
 ] 

ASF GitHub Bot commented on PARQUET-1711:
-

jinyius commented on PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#issuecomment-1260420022

   > @matthieun and @jinyius Would it be possible for you both to sync to come 
up with one solution? You can put the other one as co-author.
   
   imho, i believe #995 is a superset of functionality to this pr.




> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] jinyius commented on pull request #988: PARQUET-1711: Break circular dependencies in proto definitions

2022-09-27 Thread GitBox


jinyius commented on PR #988:
URL: https://github.com/apache/parquet-mr/pull/988#issuecomment-1260420022

   > @matthieun and @jinyius Would it be possible for you both to sync to come 
up with one solution? You can put the other one as co-author.
   
   imho, i believe #995 is a superset of functionality to this pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2184) Improve SnappyCompressor buffer expansion performance

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610269#comment-17610269
 ] 

ASF GitHub Bot commented on PARQUET-2184:
-

shangxinli commented on PR #993:
URL: https://github.com/apache/parquet-mr/pull/993#issuecomment-1260143659

   I wonder how much benefit get can gain of this fix? 




> Improve SnappyCompressor buffer expansion performance
> -
>
> Key: PARQUET-2184
> URL: https://issues.apache.org/jira/browse/PARQUET-2184
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Andrew Baranec
>Priority: Minor
>
> The existing implementation of SnappyCompressor will only allocate enough 
> bytes for the buffer passed into setInput().  This leads to suboptimal 
> performance when there are patterns of writes that cause repeated buffer 
> expansions.  In the worst case it must copy the entire buffer for every 
> single invocation of setInput()
> Instead of allocating a buffer of size current + write length,  there should 
> be an expansion strategy that reduces the amount of copying required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on pull request #993: PARQUET-2184: Improve the allocation behavior of SnappyCompressor

2022-09-27 Thread GitBox


shangxinli commented on PR #993:
URL: https://github.com/apache/parquet-mr/pull/993#issuecomment-1260143659

   I wonder how much benefit get can gain of this fix? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2184) Improve SnappyCompressor buffer expansion performance

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610265#comment-17610265
 ] 

ASF GitHub Bot commented on PARQUET-2184:
-

shangxinli commented on code in PR #993:
URL: https://github.com/apache/parquet-mr/pull/993#discussion_r981781086


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyCompressor.java:
##
@@ -96,21 +100,40 @@ public synchronized void setInput(byte[] buffer, int off, 
int len) {
 "Output buffer should be empty. Caller must call compress()");
 
 if (inputBuffer.capacity() - inputBuffer.position() < len) {
-  ByteBuffer tmp = ByteBuffer.allocateDirect(inputBuffer.position() + len);
-  inputBuffer.rewind();
-  tmp.put(inputBuffer);
-  ByteBuffer oldBuffer = inputBuffer;
-  inputBuffer = tmp;
-  CleanUtil.cleanDirectBuffer(oldBuffer);
-} else {
-  inputBuffer.limit(inputBuffer.position() + len);
+  resizeInputBuffer(inputBuffer.position() + len);
 }
 
+inputBuffer.limit(inputBuffer.position() + len);

Review Comment:
   The original code doesn't call limit if (inputBuffer.capacity() - 
inputBuffer.position() < len)  is true





> Improve SnappyCompressor buffer expansion performance
> -
>
> Key: PARQUET-2184
> URL: https://issues.apache.org/jira/browse/PARQUET-2184
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Andrew Baranec
>Priority: Minor
>
> The existing implementation of SnappyCompressor will only allocate enough 
> bytes for the buffer passed into setInput().  This leads to suboptimal 
> performance when there are patterns of writes that cause repeated buffer 
> expansions.  In the worst case it must copy the entire buffer for every 
> single invocation of setInput()
> Instead of allocating a buffer of size current + write length,  there should 
> be an expansion strategy that reduces the amount of copying required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #993: PARQUET-2184: Improve the allocation behavior of SnappyCompressor

2022-09-27 Thread GitBox


shangxinli commented on code in PR #993:
URL: https://github.com/apache/parquet-mr/pull/993#discussion_r981781086


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyCompressor.java:
##
@@ -96,21 +100,40 @@ public synchronized void setInput(byte[] buffer, int off, 
int len) {
 "Output buffer should be empty. Caller must call compress()");
 
 if (inputBuffer.capacity() - inputBuffer.position() < len) {
-  ByteBuffer tmp = ByteBuffer.allocateDirect(inputBuffer.position() + len);
-  inputBuffer.rewind();
-  tmp.put(inputBuffer);
-  ByteBuffer oldBuffer = inputBuffer;
-  inputBuffer = tmp;
-  CleanUtil.cleanDirectBuffer(oldBuffer);
-} else {
-  inputBuffer.limit(inputBuffer.position() + len);
+  resizeInputBuffer(inputBuffer.position() + len);
 }
 
+inputBuffer.limit(inputBuffer.position() + len);

Review Comment:
   The original code doesn't call limit if (inputBuffer.capacity() - 
inputBuffer.position() < len)  is true



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2184) Improve SnappyCompressor buffer expansion performance

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610260#comment-17610260
 ] 

ASF GitHub Bot commented on PARQUET-2184:
-

shangxinli commented on code in PR #993:
URL: https://github.com/apache/parquet-mr/pull/993#discussion_r981762161


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyCompressor.java:
##
@@ -32,6 +32,10 @@
  * entire input in setInput and compresses it as one compressed block.
  */
 public class SnappyCompressor implements Compressor {
+  // Double up to an 8 mb write buffer,  then switch to 1MB linear allocation
+  private static final int DOUBLING_ALLOC_THRESH =  8 << 20;

Review Comment:
   use 1 << 23 is more meaningful





> Improve SnappyCompressor buffer expansion performance
> -
>
> Key: PARQUET-2184
> URL: https://issues.apache.org/jira/browse/PARQUET-2184
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Andrew Baranec
>Priority: Minor
>
> The existing implementation of SnappyCompressor will only allocate enough 
> bytes for the buffer passed into setInput().  This leads to suboptimal 
> performance when there are patterns of writes that cause repeated buffer 
> expansions.  In the worst case it must copy the entire buffer for every 
> single invocation of setInput()
> Instead of allocating a buffer of size current + write length,  there should 
> be an expansion strategy that reduces the amount of copying required.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #993: PARQUET-2184: Improve the allocation behavior of SnappyCompressor

2022-09-27 Thread GitBox


shangxinli commented on code in PR #993:
URL: https://github.com/apache/parquet-mr/pull/993#discussion_r981762161


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyCompressor.java:
##
@@ -32,6 +32,10 @@
  * entire input in setInput and compresses it as one compressed block.
  */
 public class SnappyCompressor implements Compressor {
+  // Double up to an 8 mb write buffer,  then switch to 1MB linear allocation
+  private static final int DOUBLING_ALLOC_THRESH =  8 << 20;

Review Comment:
   use 1 << 23 is more meaningful



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610211#comment-17610211
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

shangxinli commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259976087

   Nice implementation! For the test, can you add more for interop with lz4?




> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


shangxinli commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259976087

   Nice implementation! For the test, can you add more for interop with lz4?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981649538


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressor.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.parquet.Preconditions;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+abstract public class NonBlockedDecompressor implements Decompressor {
+
+  // Buffer for uncompressed output. This buffer grows as necessary.
+  private ByteBuffer outputBuffer = ByteBuffer.allocateDirect(0);
+
+  // Buffer for compressed input. This buffer grows as necessary.
+  private ByteBuffer inputBuffer = ByteBuffer.allocateDirect(0);
+
+  private boolean finished;
+
+  /**
+   * Fills specified buffer with uncompressed data. Returns actual number
+   * of bytes of uncompressed data. A return value of 0 indicates that
+   * {@link #needsInput()} should be called in order to determine if more
+   * input data is required.
+   *
+   * @param buffer Buffer for the compressed data
+   * @param offStart offset of the data
+   * @param lenSize of the buffer
+   * @return The actual number of bytes of uncompressed data.
+   * @throws IOException if reading or decompression fails
+   */
+  @Override
+  public synchronized int decompress(byte[] buffer, int off, int len) throws 
IOException {
+SnappyUtil.validateBuffer(buffer, off, len);
+if (inputBuffer.position() == 0 && !outputBuffer.hasRemaining()) {
+  return 0;
+}
+
+if (!outputBuffer.hasRemaining()) {
+  inputBuffer.rewind();
+  Preconditions.checkArgument(inputBuffer.position() == 0, "Invalid 
position of 0.");
+  Preconditions.checkArgument(outputBuffer.position() == 0, "Invalid 
position of 0.");
+  // There is compressed input, decompress it now.
+  int decompressedSize = uncompressedLength(inputBuffer, len);
+  if (decompressedSize > outputBuffer.capacity()) {
+ByteBuffer oldBuffer = outputBuffer;
+outputBuffer = ByteBuffer.allocateDirect(decompressedSize);
+CleanUtil.cleanDirectBuffer(oldBuffer);
+  }
+
+  // Reset the previous outputBuffer (i.e. set position to 0)
+  outputBuffer.clear();
+  int size = uncompress(inputBuffer, outputBuffer);
+  outputBuffer.limit(size);
+  // We've decompressed the entire input, reset the input now
+  inputBuffer.clear();
+  inputBuffer.limit(0);
+  finished = true;
+}
+
+// Return compressed output up to 'len'
+int numBytes = Math.min(len, outputBuffer.remaining());
+outputBuffer.get(buffer, off, numBytes);
+return numBytes;
+  }
+
+  /**
+   * Sets input data for decompression.
+   * This should be called if and only if {@link #needsInput()} returns
+   * true indicating that more input data is required.
+   * (Both native and non-native versions of various Decompressors require
+   * that the data passed in via b[] remain unmodified until
+   * the caller is explicitly notified--via {@link #needsInput()}--that the
+   * buffer may be safely modified.  With this requirement, an extra
+   * buffer-copy can be avoided.)
+   *
+   * @param buffer Input data
+   * @param offStart offset
+   * @param lenLength
+   */
+  @Override
+  public synchronized void setInput(byte[] buffer, int off, int len) {
+SnappyUtil.validateBuffer(buffer, off, len);
+
+if (inputBuffer.capacity() - inputBuffer.position() < len) {
+  final ByteBuffer newBuffer = 
ByteBuffer.allocateDirect(inputBuffer.position() + len);
+  inputBuffer.rewind();
+  newBuffer.put(inputBuffer);
+  final ByteBuffer oldBuffer = inputBuffer;
+  inputBuffer = newBuffer;
+  CleanUtil.cleanDirectBuffer(oldBuffer);
+} else {
+  inputBuffer.limit(inputBuffer.position() + len);
+}
+inputBuffer.put(buffer, off, len);
+  }
+
+  @Override
+  public void end() {
+CleanUtil.cleanDirectBuffer(inputBuffer);
+CleanUtil.cleanDirectBuffer(outputBuffer);
+  }
+
+  @Override
+  public synchronized boolean finished() {
+return 

[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610210#comment-17610210
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981649538


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressor.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.parquet.Preconditions;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+abstract public class NonBlockedDecompressor implements Decompressor {
+
+  // Buffer for uncompressed output. This buffer grows as necessary.
+  private ByteBuffer outputBuffer = ByteBuffer.allocateDirect(0);
+
+  // Buffer for compressed input. This buffer grows as necessary.
+  private ByteBuffer inputBuffer = ByteBuffer.allocateDirect(0);
+
+  private boolean finished;
+
+  /**
+   * Fills specified buffer with uncompressed data. Returns actual number
+   * of bytes of uncompressed data. A return value of 0 indicates that
+   * {@link #needsInput()} should be called in order to determine if more
+   * input data is required.
+   *
+   * @param buffer Buffer for the compressed data
+   * @param offStart offset of the data
+   * @param lenSize of the buffer
+   * @return The actual number of bytes of uncompressed data.
+   * @throws IOException if reading or decompression fails
+   */
+  @Override
+  public synchronized int decompress(byte[] buffer, int off, int len) throws 
IOException {
+SnappyUtil.validateBuffer(buffer, off, len);
+if (inputBuffer.position() == 0 && !outputBuffer.hasRemaining()) {
+  return 0;
+}
+
+if (!outputBuffer.hasRemaining()) {
+  inputBuffer.rewind();
+  Preconditions.checkArgument(inputBuffer.position() == 0, "Invalid 
position of 0.");
+  Preconditions.checkArgument(outputBuffer.position() == 0, "Invalid 
position of 0.");
+  // There is compressed input, decompress it now.
+  int decompressedSize = uncompressedLength(inputBuffer, len);
+  if (decompressedSize > outputBuffer.capacity()) {
+ByteBuffer oldBuffer = outputBuffer;
+outputBuffer = ByteBuffer.allocateDirect(decompressedSize);
+CleanUtil.cleanDirectBuffer(oldBuffer);
+  }
+
+  // Reset the previous outputBuffer (i.e. set position to 0)
+  outputBuffer.clear();
+  int size = uncompress(inputBuffer, outputBuffer);
+  outputBuffer.limit(size);
+  // We've decompressed the entire input, reset the input now
+  inputBuffer.clear();
+  inputBuffer.limit(0);
+  finished = true;
+}
+
+// Return compressed output up to 'len'
+int numBytes = Math.min(len, outputBuffer.remaining());
+outputBuffer.get(buffer, off, numBytes);
+return numBytes;
+  }
+
+  /**
+   * Sets input data for decompression.
+   * This should be called if and only if {@link #needsInput()} returns
+   * true indicating that more input data is required.
+   * (Both native and non-native versions of various Decompressors require
+   * that the data passed in via b[] remain unmodified until
+   * the caller is explicitly notified--via {@link #needsInput()}--that the
+   * buffer may be safely modified.  With this requirement, an extra
+   * buffer-copy can be avoided.)
+   *
+   * @param buffer Input data
+   * @param offStart offset
+   * @param lenLength
+   */
+  @Override
+  public synchronized void setInput(byte[] buffer, int off, int len) {
+SnappyUtil.validateBuffer(buffer, off, len);
+
+if (inputBuffer.capacity() - inputBuffer.position() < len) {
+  final ByteBuffer newBuffer = 
ByteBuffer.allocateDirect(inputBuffer.position() + len);
+  inputBuffer.rewind();
+  newBuffer.put(inputBuffer);
+  final ByteBuffer oldBuffer = inputBuffer;
+  inputBuffer = newBuffer;
+  CleanUtil.cleanDirectBuffer(oldBuffer);
+} else {
+  inputBuffer.limit(inputBuffer.position() + len);
+}
+

[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610209#comment-17610209
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981648516


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressor.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.parquet.Preconditions;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+abstract public class NonBlockedDecompressor implements Decompressor {
+
+  // Buffer for uncompressed output. This buffer grows as necessary.
+  private ByteBuffer outputBuffer = ByteBuffer.allocateDirect(0);
+
+  // Buffer for compressed input. This buffer grows as necessary.
+  private ByteBuffer inputBuffer = ByteBuffer.allocateDirect(0);
+
+  private boolean finished;
+
+  /**
+   * Fills specified buffer with uncompressed data. Returns actual number
+   * of bytes of uncompressed data. A return value of 0 indicates that
+   * {@link #needsInput()} should be called in order to determine if more
+   * input data is required.
+   *
+   * @param buffer Buffer for the compressed data
+   * @param offStart offset of the data
+   * @param lenSize of the buffer
+   * @return The actual number of bytes of uncompressed data.
+   * @throws IOException if reading or decompression fails
+   */
+  @Override
+  public synchronized int decompress(byte[] buffer, int off, int len) throws 
IOException {
+SnappyUtil.validateBuffer(buffer, off, len);
+if (inputBuffer.position() == 0 && !outputBuffer.hasRemaining()) {
+  return 0;
+}
+
+if (!outputBuffer.hasRemaining()) {
+  inputBuffer.rewind();
+  Preconditions.checkArgument(inputBuffer.position() == 0, "Invalid 
position of 0.");
+  Preconditions.checkArgument(outputBuffer.position() == 0, "Invalid 
position of 0.");
+  // There is compressed input, decompress it now.
+  int decompressedSize = uncompressedLength(inputBuffer, len);
+  if (decompressedSize > outputBuffer.capacity()) {
+ByteBuffer oldBuffer = outputBuffer;
+outputBuffer = ByteBuffer.allocateDirect(decompressedSize);
+CleanUtil.cleanDirectBuffer(oldBuffer);
+  }
+
+  // Reset the previous outputBuffer (i.e. set position to 0)
+  outputBuffer.clear();
+  int size = uncompress(inputBuffer, outputBuffer);
+  outputBuffer.limit(size);
+  // We've decompressed the entire input, reset the input now
+  inputBuffer.clear();
+  inputBuffer.limit(0);
+  finished = true;
+}
+
+// Return compressed output up to 'len'
+int numBytes = Math.min(len, outputBuffer.remaining());
+outputBuffer.get(buffer, off, numBytes);
+return numBytes;
+  }
+
+  /**
+   * Sets input data for decompression.
+   * This should be called if and only if {@link #needsInput()} returns
+   * true indicating that more input data is required.
+   * (Both native and non-native versions of various Decompressors require
+   * that the data passed in via b[] remain unmodified until
+   * the caller is explicitly notified--via {@link #needsInput()}--that the
+   * buffer may be safely modified.  With this requirement, an extra
+   * buffer-copy can be avoided.)
+   *
+   * @param buffer Input data
+   * @param offStart offset
+   * @param lenLength
+   */
+  @Override
+  public synchronized void setInput(byte[] buffer, int off, int len) {
+SnappyUtil.validateBuffer(buffer, off, len);

Review Comment:
   Should we refactor this SnappyUtil also? 





> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 

[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981648516


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressor.java:
##
@@ -0,0 +1,174 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.parquet.Preconditions;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+abstract public class NonBlockedDecompressor implements Decompressor {
+
+  // Buffer for uncompressed output. This buffer grows as necessary.
+  private ByteBuffer outputBuffer = ByteBuffer.allocateDirect(0);
+
+  // Buffer for compressed input. This buffer grows as necessary.
+  private ByteBuffer inputBuffer = ByteBuffer.allocateDirect(0);
+
+  private boolean finished;
+
+  /**
+   * Fills specified buffer with uncompressed data. Returns actual number
+   * of bytes of uncompressed data. A return value of 0 indicates that
+   * {@link #needsInput()} should be called in order to determine if more
+   * input data is required.
+   *
+   * @param buffer Buffer for the compressed data
+   * @param offStart offset of the data
+   * @param lenSize of the buffer
+   * @return The actual number of bytes of uncompressed data.
+   * @throws IOException if reading or decompression fails
+   */
+  @Override
+  public synchronized int decompress(byte[] buffer, int off, int len) throws 
IOException {
+SnappyUtil.validateBuffer(buffer, off, len);
+if (inputBuffer.position() == 0 && !outputBuffer.hasRemaining()) {
+  return 0;
+}
+
+if (!outputBuffer.hasRemaining()) {
+  inputBuffer.rewind();
+  Preconditions.checkArgument(inputBuffer.position() == 0, "Invalid 
position of 0.");
+  Preconditions.checkArgument(outputBuffer.position() == 0, "Invalid 
position of 0.");
+  // There is compressed input, decompress it now.
+  int decompressedSize = uncompressedLength(inputBuffer, len);
+  if (decompressedSize > outputBuffer.capacity()) {
+ByteBuffer oldBuffer = outputBuffer;
+outputBuffer = ByteBuffer.allocateDirect(decompressedSize);
+CleanUtil.cleanDirectBuffer(oldBuffer);
+  }
+
+  // Reset the previous outputBuffer (i.e. set position to 0)
+  outputBuffer.clear();
+  int size = uncompress(inputBuffer, outputBuffer);
+  outputBuffer.limit(size);
+  // We've decompressed the entire input, reset the input now
+  inputBuffer.clear();
+  inputBuffer.limit(0);
+  finished = true;
+}
+
+// Return compressed output up to 'len'
+int numBytes = Math.min(len, outputBuffer.remaining());
+outputBuffer.get(buffer, off, numBytes);
+return numBytes;
+  }
+
+  /**
+   * Sets input data for decompression.
+   * This should be called if and only if {@link #needsInput()} returns
+   * true indicating that more input data is required.
+   * (Both native and non-native versions of various Decompressors require
+   * that the data passed in via b[] remain unmodified until
+   * the caller is explicitly notified--via {@link #needsInput()}--that the
+   * buffer may be safely modified.  With this requirement, an extra
+   * buffer-copy can be avoided.)
+   *
+   * @param buffer Input data
+   * @param offStart offset
+   * @param lenLength
+   */
+  @Override
+  public synchronized void setInput(byte[] buffer, int off, int len) {
+SnappyUtil.validateBuffer(buffer, off, len);

Review Comment:
   Should we refactor this SnappyUtil also? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Vectored IO in Parquet ( https://issues.apache.org/jira/browse/PARQUET-2171)

2022-09-27 Thread Mukund Madhav Thakur
Hi Team,
We in hadoop project recently added a new feature in Hadoop Vectored IO
which will be released in the upcoming 3.3.5 hadoop release.
This is a high performance scatter/gather extension of PositionedReadable
API optimized for reading columnar data in cloud storage.
https://issues.apache.org/jira/browse/HADOOP-18103.
We observed really good performance improvements in hive tpch and tpcds
benchmark for orc data stored in S3.

We are now looking at Parquet integration as well.
https://issues.apache.org/jira/browse/PARQUET-2171
I have a draft patch which works locally through sparks file reader.
https://github.com/apache/parquet-mr/pull/999

We know Parquet likes to support builds against the older versions of
hadoop, we are working on a solution to offer the API through a
shim library.
As I have never contributed to the Parquet codebase and it is totally new
for me, I would really appreciate some help in implementing, testing and
releasing this feature in the best possible way.

I will be talking about all these in the upcoming Apache Conference NA next
week Tuesday, October 04, 4:10 PM CDT. It would be really great to meet
anyone who would be interested in getting involved in this.



Thanks,
Mukund


Parquet community sync meeting notes - 9/27/2022

2022-09-27 Thread Xinli shang
9/27/2022

Attendees ( Gidon Gershinsky, Xinli Shang, Tim Miller, Jiasheng Zhang)

   1.

   Parquet Cell-level encryption
   1.

  Will open PRs after delivering it internally
  2.

   Parquet-2069 : Fix some
   Avro schema issues, in general, Avro schema is a problematic area and we
   need some risk control.
   3.

   Parquet-2126 :
   thread-safe compressor/decompressor
   1.

  Xinli to have another look along with other PRs
  4.

   Parquet-2196 - LZW Raw compressor
   1.

  In review
  5.

   PR-960: Has some comments
   6.

   Parquet-986: merged
   7.

   Parquet-1711: Break circular dependency, how to handle the exception
   case


-- 
Xinli Shang


[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610117#comment-17610117
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981380734


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawDecompressor.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import io.airlift.compress.lz4.Lz4Decompressor;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+public class Lz4RawDecompressor extends NonBlockedDecompressor {
+
+  private Lz4Decompressor decompressor = new Lz4Decompressor();
+
+  @Override
+  protected int uncompress(ByteBuffer compressed, ByteBuffer uncompressed) 
throws IOException {
+decompressor.decompress(compressed, uncompressed);
+int uncompressedSize = uncompressed.position();
+uncompressed.limit(uncompressedSize);
+uncompressed.rewind();
+return uncompressedSize;
+  }
+
+  @Override
+  protected int uncompressedLength(ByteBuffer compressed, int 
maxUncompressedLength) throws IOException {

Review Comment:
   Are you intentionally for uncompressedMaxLength?





> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981380734


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawDecompressor.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import io.airlift.compress.lz4.Lz4Decompressor;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+public class Lz4RawDecompressor extends NonBlockedDecompressor {
+
+  private Lz4Decompressor decompressor = new Lz4Decompressor();
+
+  @Override
+  protected int uncompress(ByteBuffer compressed, ByteBuffer uncompressed) 
throws IOException {
+decompressor.decompress(compressed, uncompressed);
+int uncompressedSize = uncompressed.position();
+uncompressed.limit(uncompressedSize);
+uncompressed.rewind();
+return uncompressedSize;
+  }
+
+  @Override
+  protected int uncompressedLength(ByteBuffer compressed, int 
maxUncompressedLength) throws IOException {

Review Comment:
   Are you intentionally for uncompressedMaxLength?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610116#comment-17610116
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981379233


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawDecompressor.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import io.airlift.compress.lz4.Lz4Decompressor;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+public class Lz4RawDecompressor extends NonBlockedDecompressor {
+
+  private Lz4Decompressor decompressor = new Lz4Decompressor();
+
+  @Override
+  protected int uncompress(ByteBuffer compressed, ByteBuffer uncompressed) 
throws IOException {

Review Comment:
   For the parity of the two files(LZ4RawCompressor.java and 
LZ4RawDecompressor.java), can you follow the same order of the methods 
xxxLength()xxcompress()? This is minor issue though. 





> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981379233


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawDecompressor.java:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import io.airlift.compress.lz4.Lz4Decompressor;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+public class Lz4RawDecompressor extends NonBlockedDecompressor {
+
+  private Lz4Decompressor decompressor = new Lz4Decompressor();
+
+  @Override
+  protected int uncompress(ByteBuffer compressed, ByteBuffer uncompressed) 
throws IOException {

Review Comment:
   For the parity of the two files(LZ4RawCompressor.java and 
LZ4RawDecompressor.java), can you follow the same order of the methods 
xxxLength()xxcompress()? This is minor issue though. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610112#comment-17610112
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981373160


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCompressor.java:
##
@@ -0,0 +1,44 @@
+/* 
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import io.airlift.compress.lz4.Lz4Compressor;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+public class Lz4RawCompressor extends NonBlockedCompressor {
+
+  private Lz4Compressor compressor = new Lz4Compressor();
+
+  @Override
+  protected int maxCompressedLength(int byteSize) {
+return 
io.airlift.compress.lz4.Lz4RawCompressor.maxCompressedLength(byteSize);

Review Comment:
   use import



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCompressor.java:
##
@@ -0,0 +1,44 @@
+/* 
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import io.airlift.compress.lz4.Lz4Compressor;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+public class Lz4RawCompressor extends NonBlockedCompressor {
+
+  private Lz4Compressor compressor = new Lz4Compressor();
+
+  @Override
+  protected int maxCompressedLength(int byteSize) {
+return 
io.airlift.compress.lz4.Lz4RawCompressor.maxCompressedLength(byteSize);

Review Comment:
   use import?





> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981373160


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCompressor.java:
##
@@ -0,0 +1,44 @@
+/* 
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import io.airlift.compress.lz4.Lz4Compressor;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+public class Lz4RawCompressor extends NonBlockedCompressor {
+
+  private Lz4Compressor compressor = new Lz4Compressor();
+
+  @Override
+  protected int maxCompressedLength(int byteSize) {
+return 
io.airlift.compress.lz4.Lz4RawCompressor.maxCompressedLength(byteSize);

Review Comment:
   use import



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCompressor.java:
##
@@ -0,0 +1,44 @@
+/* 
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import io.airlift.compress.lz4.Lz4Compressor;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+public class Lz4RawCompressor extends NonBlockedCompressor {
+
+  private Lz4Compressor compressor = new Lz4Compressor();
+
+  @Override
+  protected int maxCompressedLength(int byteSize) {
+return 
io.airlift.compress.lz4.Lz4RawCompressor.maxCompressedLength(byteSize);

Review Comment:
   use import?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610111#comment-17610111
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981370983


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCodec.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.compress.*;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * Lz4 raw compression codec for Parquet. This codec type has been introduced
+ * into the parquet format since version 2.9.0.
+ */
+public class Lz4RawCodec implements Configurable, CompressionCodec {
+
+  private Configuration conf;
+  // Hadoop config for how big to make intermediate buffers.
+  private final String BUFFER_SIZE_CONFIG = "io.file.buffer.size";
+
+  @Override
+  public void setConf(Configuration conf) {
+this.conf = conf;
+  }
+
+  @Override
+  public Configuration getConf() {
+return conf;
+  }
+
+  @Override
+  public Compressor createCompressor() {
+return new Lz4RawCompressor();
+  }
+
+  @Override
+  public Decompressor createDecompressor() {
+return new Lz4RawDecompressor();
+  }
+
+  @Override
+  public CompressionInputStream createInputStream(InputStream stream)
+throws IOException {
+return createInputStream(stream, createDecompressor());
+  }
+
+  @Override
+  public CompressionInputStream createInputStream(InputStream stream,
+  Decompressor decompressor) 
throws IOException {
+return new NonBlockedDecompressorStream(stream, decompressor,
+  conf.getInt(BUFFER_SIZE_CONFIG, 4*1024));

Review Comment:
   Can you make it a DEFAULT_ BUFFER_SIZE_CONFIG?





> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981370983


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCodec.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.compress.*;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * Lz4 raw compression codec for Parquet. This codec type has been introduced
+ * into the parquet format since version 2.9.0.
+ */
+public class Lz4RawCodec implements Configurable, CompressionCodec {
+
+  private Configuration conf;
+  // Hadoop config for how big to make intermediate buffers.
+  private final String BUFFER_SIZE_CONFIG = "io.file.buffer.size";
+
+  @Override
+  public void setConf(Configuration conf) {
+this.conf = conf;
+  }
+
+  @Override
+  public Configuration getConf() {
+return conf;
+  }
+
+  @Override
+  public Compressor createCompressor() {
+return new Lz4RawCompressor();
+  }
+
+  @Override
+  public Decompressor createDecompressor() {
+return new Lz4RawDecompressor();
+  }
+
+  @Override
+  public CompressionInputStream createInputStream(InputStream stream)
+throws IOException {
+return createInputStream(stream, createDecompressor());
+  }
+
+  @Override
+  public CompressionInputStream createInputStream(InputStream stream,
+  Decompressor decompressor) 
throws IOException {
+return new NonBlockedDecompressorStream(stream, decompressor,
+  conf.getInt(BUFFER_SIZE_CONFIG, 4*1024));

Review Comment:
   Can you make it a DEFAULT_ BUFFER_SIZE_CONFIG?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610109#comment-17610109
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981369466


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCodec.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.compress.*;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * Lz4 raw compression codec for Parquet. This codec type has been introduced
+ * into the parquet format since version 2.9.0.
+ */
+public class Lz4RawCodec implements Configurable, CompressionCodec {
+
+  private Configuration conf;
+  // Hadoop config for how big to make intermediate buffers.
+  private final String BUFFER_SIZE_CONFIG = "io.file.buffer.size";

Review Comment:
   Can we make it clear the unit of the size





> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


shangxinli commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981369466


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/Lz4RawCodec.java:
##
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.codec;
+
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.compress.*;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+
+/**
+ * Lz4 raw compression codec for Parquet. This codec type has been introduced
+ * into the parquet format since version 2.9.0.
+ */
+public class Lz4RawCodec implements Configurable, CompressionCodec {
+
+  private Configuration conf;
+  // Hadoop config for how big to make intermediate buffers.
+  private final String BUFFER_SIZE_CONFIG = "io.file.buffer.size";

Review Comment:
   Can we make it clear the unit of the size



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610104#comment-17610104
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

shangxinli commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259621617

   Thank Gang for contributing! Is there any benchmarking numbers? Any 
comparison with ZSTD? These are non-blocking question for review and merging. 




> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] shangxinli commented on pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


shangxinli commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259621617

   Thank Gang for contributing! Is there any benchmarking numbers? Any 
comparison with ZSTD? These are non-blocking question for review and merging. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610028#comment-17610028
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

wgtmac commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259431976

   > @wgtmac Did you try to read an actual file produced by Parquet C++?
   > 
   > Note you can find such files in https://github.com/apache/parquet-testing/
   
   Yes, I have tried that. I will add some parquet files for compatibility test 
as well.




> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


wgtmac commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259431976

   > @wgtmac Did you try to read an actual file produced by Parquet C++?
   > 
   > Note you can find such files in https://github.com/apache/parquet-testing/
   
   Yes, I have tried that. I will add some parquet files for compatibility test 
as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17609962#comment-17609962
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

pitrou commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981055817


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLz4RawCodec.java:
##
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import org.apache.parquet.hadoop.codec.*;
+import org.junit.Test;
+
+import java.io.IOException;
+
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.assertEquals;
+
+public class TestLz4RawCodec {
+  @Test
+  public void TestLz4Raw() throws IOException {
+// Reuse the snappy objects between test cases

Review Comment:
   "snappy"?



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLz4RawCodec.java:
##
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import org.apache.parquet.hadoop.codec.*;
+import org.junit.Test;
+
+import java.io.IOException;
+
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.assertEquals;
+
+public class TestLz4RawCodec {
+  @Test
+  public void TestLz4Raw() throws IOException {
+// Reuse the snappy objects between test cases
+Lz4RawCompressor compressor = new Lz4RawCompressor();
+Lz4RawDecompressor decompressor = new Lz4RawDecompressor();
+
+TestLz4Raw(compressor, decompressor, "");
+TestLz4Raw(compressor, decompressor, "FooBar");
+TestLz4Raw(compressor, decompressor, "FooBar1", "FooBar2");
+TestLz4Raw(compressor, decompressor, "FooBar");
+TestLz4Raw(compressor, decompressor, "a", "blahblahblah", "abcdef");
+TestLz4Raw(compressor, decompressor, "");
+TestLz4Raw(compressor, decompressor, "FooBar");
+  }
+
+  private void TestLz4Raw(Lz4RawCompressor compressor, Lz4RawDecompressor 
decompressor,
+  String... strings) throws IOException {
+compressor.reset();
+decompressor.reset();
+
+int uncompressedSize = 0;
+for (String s : strings) {
+  uncompressedSize += s.length();
+}
+byte[] uncompressedData = new byte[uncompressedSize];
+int len = 0;
+for (String s : strings) {
+  byte[] tmp = s.getBytes();
+  System.arraycopy(tmp, 0, uncompressedData, len, s.length());
+  len += s.length();
+}

Review Comment:
   Also what's the point of passing several strings if you're just 
concatenating them together?



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLz4RawCodec.java:
##
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, 

[GitHub] [parquet-mr] pitrou commented on a diff in pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


pitrou commented on code in PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#discussion_r981055817


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLz4RawCodec.java:
##
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import org.apache.parquet.hadoop.codec.*;
+import org.junit.Test;
+
+import java.io.IOException;
+
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.assertEquals;
+
+public class TestLz4RawCodec {
+  @Test
+  public void TestLz4Raw() throws IOException {
+// Reuse the snappy objects between test cases

Review Comment:
   "snappy"?



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLz4RawCodec.java:
##
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import org.apache.parquet.hadoop.codec.*;
+import org.junit.Test;
+
+import java.io.IOException;
+
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.assertEquals;
+
+public class TestLz4RawCodec {
+  @Test
+  public void TestLz4Raw() throws IOException {
+// Reuse the snappy objects between test cases
+Lz4RawCompressor compressor = new Lz4RawCompressor();
+Lz4RawDecompressor decompressor = new Lz4RawDecompressor();
+
+TestLz4Raw(compressor, decompressor, "");
+TestLz4Raw(compressor, decompressor, "FooBar");
+TestLz4Raw(compressor, decompressor, "FooBar1", "FooBar2");
+TestLz4Raw(compressor, decompressor, "FooBar");
+TestLz4Raw(compressor, decompressor, "a", "blahblahblah", "abcdef");
+TestLz4Raw(compressor, decompressor, "");
+TestLz4Raw(compressor, decompressor, "FooBar");
+  }
+
+  private void TestLz4Raw(Lz4RawCompressor compressor, Lz4RawDecompressor 
decompressor,
+  String... strings) throws IOException {
+compressor.reset();
+decompressor.reset();
+
+int uncompressedSize = 0;
+for (String s : strings) {
+  uncompressedSize += s.length();
+}
+byte[] uncompressedData = new byte[uncompressedSize];
+int len = 0;
+for (String s : strings) {
+  byte[] tmp = s.getBytes();
+  System.arraycopy(tmp, 0, uncompressedData, len, s.length());
+  len += s.length();
+}

Review Comment:
   Also what's the point of passing several strings if you're just 
concatenating them together?



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLz4RawCodec.java:
##
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import 

[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17609945#comment-17609945
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

pitrou commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259279363

   cc @lidavidm




> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] pitrou commented on pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


pitrou commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259279363

   cc @lidavidm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17609929#comment-17609929
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

wgtmac commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259233215

   @pitrou @shangxinli Can you please take a look? Thanks in advance!




> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac commented on pull request #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


wgtmac commented on PR #1000:
URL: https://github.com/apache/parquet-mr/pull/1000#issuecomment-1259233215

   @pitrou @shangxinli Can you please take a look? Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17609928#comment-17609928
 ] 

ASF GitHub Bot commented on PARQUET-2196:
-

wgtmac opened a new pull request, #1000:
URL: https://github.com/apache/parquet-mr/pull/1000

   This PR implements the LZ4_RAW codec which was introduced by parquet format 
v2.9.0. Since there are a lot of common logic between the LZ4_RAW and SNAPPY 
codecs, this patch moves them into NonBlockedCompressor and 
NonBlockedDecompressor and make the specific codec extend them.
   
   Added TestLz4RawCodec test to make sure the new codec itself is correct.




> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] wgtmac opened a new pull request, #1000: PARQUET-2196: Support LZ4_RAW codec

2022-09-27 Thread GitBox


wgtmac opened a new pull request, #1000:
URL: https://github.com/apache/parquet-mr/pull/1000

   This PR implements the LZ4_RAW codec which was introduced by parquet format 
v2.9.0. Since there are a lot of common logic between the LZ4_RAW and SNAPPY 
codecs, this patch moves them into NonBlockedCompressor and 
NonBlockedDecompressor and make the specific codec extend them.
   
   Added TestLz4RawCodec test to make sure the new codec itself is correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated PARQUET-2196:
-
Description: There is a long history about the LZ4 interoperability of 
parquet files between parquet-mr and parquet-cpp (which is now in the Apache 
Arrow). Attached links are the evidence. In short, a new LZ4_RAW codec type has 
been introduced since parquet format v2.9.0. However, only parquet-cpp supports 
LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
and cannot read parquet files with LZ4_RAW.  (was: There is a long history 
about the LZ4 interoperability of parquet files between parquet-mr and 
parquet-cpp (which is now in the Apache Arrow). Since parquet format v2.9.0, a 
new LZ4_RAW codec type has been introduced but)

> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). 
> Attached links are the evidence. In short, a new LZ4_RAW codec type has been 
> introduced since parquet format v2.9.0. However, only parquet-cpp supports 
> LZ4_RAW. The parquet-mr library still uses the old Hadoop-provided LZ4 codec 
> and cannot read parquet files with LZ4_RAW.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated PARQUET-2196:
-
Description: There is a long history about the LZ4 interoperability of 
parquet files between parquet-mr and parquet-cpp (which is now in the Apache 
Arrow). Since parquet format v2.9.0, a new LZ4_RAW codec type has been 
introduced but

> Support LZ4_RAW codec
> -
>
> Key: PARQUET-2196
> URL: https://issues.apache.org/jira/browse/PARQUET-2196
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gang Wu
>Priority: Major
>
> There is a long history about the LZ4 interoperability of parquet files 
> between parquet-mr and parquet-cpp (which is now in the Apache Arrow). Since 
> parquet format v2.9.0, a new LZ4_RAW codec type has been introduced but



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2196) Support LZ4_RAW codec

2022-09-27 Thread Gang Wu (Jira)
Gang Wu created PARQUET-2196:


 Summary: Support LZ4_RAW codec
 Key: PARQUET-2196
 URL: https://issues.apache.org/jira/browse/PARQUET-2196
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gang Wu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)