arouel opened a new issue, #3464:
URL: https://github.com/apache/parquet-java/issues/3464

   ### Describe the enhancement requested
   
   `DeltaByteArrayWriter.writeBytes()` has two performance issues in its hot 
path, executed for every Binary value written with `DELTA_BYTE_ARRAY` encoding 
(the default for BINARY columns in V2 data pages):
   
   ### Problem 1: `getBytes()` allocates a new `byte[]` for every value
   
   [Line 
92](https://github.com/apache/parquet-java/blob/4c8f4d4b/parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.java#L92)
 calls `v.getBytes()`:
   
   ```java
   byte[] vb = v.getBytes();
   ```
   
   `Binary.getBytes()` **always** allocates a new `byte[]` and copies into it — 
even for `ByteArrayBackedBinary` (which does `Arrays.copyOfRange`) and 
`ByteBufferBackedBinary` (which allocates `new byte[length]` and copies from 
the ByteBuffer). The method `getBytesUnsafe()` exists and returns the backing 
array directly when the Binary owns its bytes (`isBackingBytesReused == 
false`), avoiding the copy entirely.
   
   The returned `byte[]` (`vb`) is used for two purposes:
   
   1. **Prefix comparison** against `previous` (line 95) — read-only, 
`getBytesUnsafe()` is sufficient
   2. **Assigned to `previous`** for the next iteration (line 99) — must be an 
owned copy that won't be mutated by the caller
   
   When `isBackingBytesReused` is `false` (common case for values read from 
Parquet pages via `ColumnReader`), `getBytesUnsafe()` returns a stable array 
that is safe to retain. When `isBackingBytesReused` is `true`, a defensive copy 
is needed only for `previous`.
   
   ### Problem 2: byte-by-byte prefix comparison loop
   
   [Line 
95](https://github.com/apache/parquet-java/blob/4c8f4d4b/parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.java#L95)
 finds the common prefix between the current and previous value:
   
   ```java
   for (i = 0; (i < length) && (previous[i] == vb[i]); i++)
       ;
   ```
   
   This byte-by-byte loop cannot be vectorized by the JIT because of the 
per-element early-exit condition. `Arrays.mismatch(byte[], int, int, byte[], 
int, int)` (available since Java 9) performs the same operation but is 
[intrinsified by the JVM](https://bugs.openjdk.org/browse/JDK-8033148) to use 
SIMD vector comparison.
   
   
   ### Component(s)
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to