[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #960: Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream

GitBox Fri, 11 Nov 2022 07:35:44 -0800


wgtmac commented on code in PR #960:
URL: https://github.com/apache/parquet-mr/pull/960#discussion_r1020313641



##########
parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java:
##########
@@ -157,4 +165,80 @@ public void reset() throws IOException {
   public boolean markSupported() {
     return delegate.markSupported();
   }
+
+  public boolean readBoolean() throws IOException {
+    return readByte() != 0;
+  }
+
+  public byte readByte() throws IOException {
+    return delegate.readByte();
+  }
+
+  public int readUnsignedByte() throws IOException {
+    return delegate.readUnsignedByte();
+  }
+
+  public short readShort() throws IOException {

Review Comment:
   I like the idea to provide these read functions to enable larger read. BTW, 
is there any use case to read a batch of shorts (and other numeric types)?



##########
parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java:
##########
@@ -238,8 +257,31 @@ public int read(byte[] bytes, int off, int len) {
   }
 
   @Override
-  public int read(byte[] bytes) {
-    return read(bytes, 0, bytes.length);
+  public void readFully(byte[] bytes, int off, int len) throws IOException {
+    if (len <= 0) {
+      if (len < 0) {
+        throw new IndexOutOfBoundsException("Read length must be greater than 
0: " + len);
+      }
+      
+      return;
+    }
+
+    if (current == null || len > length) {
+      throw new EOFException();
+    }
+
+    int bytesRead = 0;
+    while (bytesRead < len) {

Review Comment:
   > Well, my objective here is to maximize performance. So we have to decide 
between maintainability and performance. Let's deliberate over this a bit more, 
and I'll do what you think is best.
   
   I deeply understand this is a difficult trade-off. Do you have any evidence 
on the performance penalty if we wrap `read` and `readFully` methods to share 
some common logic? If the penalty is acceptable, we should definitely go for 
maintainability. 



##########
parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java:
##########
@@ -88,6 +136,21 @@ public long skip(long n) {
     return bytesToSkip;
   }
 
+  @Override
+  public void skipFully(long n) throws IOException {
+    if (n < 0 || n > Integer.MAX_VALUE) {
+      throw new IllegalArgumentException();
+    }
+    
+    try {
+      buffer.position(buffer.position() + (int)n);
+    } catch (IllegalArgumentException e) {

Review Comment:
   > A try/catch is basically free when no exception is thrown, while a check 
is not. I have tested this, and the try/catch is empirically faster, since the 
no-exception case is the common case. Putting in the check means we have to 
wait until the C2 compiler produces a trace without the branch. But even then, 
there's always the overhead of a check somewhere to be able to fall back to 
interpretation if the condition is not correct for the trace. I'm avoiding all 
of that entirely, making this faster in of the most common case.
   
   That's interesting. A caveat is that try/catch may prohibit code 
optimization. But I doubt whether it makes significant difference on the simple 
buffer operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #960: Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream

Reply via email to