arouel opened a new issue, #3464: URL: https://github.com/apache/parquet-java/issues/3464
### Describe the enhancement requested `DeltaByteArrayWriter.writeBytes()` has two performance issues in its hot path, executed for every Binary value written with `DELTA_BYTE_ARRAY` encoding (the default for BINARY columns in V2 data pages): ### Problem 1: `getBytes()` allocates a new `byte[]` for every value [Line 92](https://github.com/apache/parquet-java/blob/4c8f4d4b/parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.java#L92) calls `v.getBytes()`: ```java byte[] vb = v.getBytes(); ``` `Binary.getBytes()` **always** allocates a new `byte[]` and copies into it — even for `ByteArrayBackedBinary` (which does `Arrays.copyOfRange`) and `ByteBufferBackedBinary` (which allocates `new byte[length]` and copies from the ByteBuffer). The method `getBytesUnsafe()` exists and returns the backing array directly when the Binary owns its bytes (`isBackingBytesReused == false`), avoiding the copy entirely. The returned `byte[]` (`vb`) is used for two purposes: 1. **Prefix comparison** against `previous` (line 95) — read-only, `getBytesUnsafe()` is sufficient 2. **Assigned to `previous`** for the next iteration (line 99) — must be an owned copy that won't be mutated by the caller When `isBackingBytesReused` is `false` (common case for values read from Parquet pages via `ColumnReader`), `getBytesUnsafe()` returns a stable array that is safe to retain. When `isBackingBytesReused` is `true`, a defensive copy is needed only for `previous`. ### Problem 2: byte-by-byte prefix comparison loop [Line 95](https://github.com/apache/parquet-java/blob/4c8f4d4b/parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.java#L95) finds the common prefix between the current and previous value: ```java for (i = 0; (i < length) && (previous[i] == vb[i]); i++) ; ``` This byte-by-byte loop cannot be vectorized by the JIT because of the per-element early-exit condition. `Arrays.mismatch(byte[], int, int, byte[], int, int)` (available since Java 9) performs the same operation but is [intrinsified by the JVM](https://bugs.openjdk.org/browse/JDK-8033148) to use SIMD vector comparison. ### Component(s) _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
