iemejia commented on code in PR #3476:
URL: https://github.com/apache/parquet-java/pull/3476#discussion_r3106683614


##########
parquet-plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java:
##########
@@ -3175,13 +3171,30 @@ public final void unpackValuesUsingVector(final byte[] 
in, final int inPos, fina
 
     public final void unpackValuesUsingVector(
         final ByteBuffer in, final int inPos, final int[] out, final int 
outPos) {
-      ByteVector byteVector = ByteVector.fromByteBuffer(B512, in, inPos, 
in.order(), inp_mask);
+      ByteVector byteVector = fromByteBuffer(B512, in, inPos, inp_mask);
       IntVector tempRes1 = 
byteVector.rearrange(perm_mask0).reinterpretAsInts();
 
       tempRes1.intoArray(out, outPos, out_mask);
     }
   }
 
+  private static ByteVector fromByteBuffer(VectorSpecies<Byte> species, 
ByteBuffer input, int inPos) {
+    return ByteVector.fromArray(species, readInputBytes(input, inPos, 
species.length()), 0);
+  }
+
+  private static ByteVector fromByteBuffer(
+      VectorSpecies<Byte> species, ByteBuffer input, int inPos, 
VectorMask<Byte> mask) {
+    return ByteVector.fromArray(species, readInputBytes(input, inPos, 
mask.trueCount()), 0, mask);
+  }

Review Comment:
   Good catch — this was wrong. `mask.trueCount()` only happens to be safe at 
the call sites in this file because every mask is built as a contiguous-prefix 
mask, but `ByteVector.fromArray(species, array, 0, mask)` requires 
`array.length >= species.length()` for any mask shape, and the masked overload 
of OpenJDK 17's `ByteVector.fromByteBuffer` actually loads the full window 
unconditionally and then `blend`s with the mask — not a prefix copy. The fix in 
dffb6e6f6 is exactly that: always allocate `species.length()` bytes and let 
`fromArray(..., mask)` apply the mask, matching JDK semantics for arbitrary 
masks. Verified the AVX-512 unpack tests still pass on JDK 17 / 21 / 25.



##########
parquet-plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java:
##########
@@ -3175,13 +3171,30 @@ public final void unpackValuesUsingVector(final byte[] 
in, final int inPos, fina
 
     public final void unpackValuesUsingVector(
         final ByteBuffer in, final int inPos, final int[] out, final int 
outPos) {
-      ByteVector byteVector = ByteVector.fromByteBuffer(B512, in, inPos, 
in.order(), inp_mask);
+      ByteVector byteVector = fromByteBuffer(B512, in, inPos, inp_mask);
       IntVector tempRes1 = 
byteVector.rearrange(perm_mask0).reinterpretAsInts();
 
       tempRes1.intoArray(out, outPos, out_mask);
     }
   }
 
+  private static ByteVector fromByteBuffer(VectorSpecies<Byte> species, 
ByteBuffer input, int inPos) {
+    return ByteVector.fromArray(species, readInputBytes(input, inPos, 
species.length()), 0);
+  }
+
+  private static ByteVector fromByteBuffer(
+      VectorSpecies<Byte> species, ByteBuffer input, int inPos, 
VectorMask<Byte> mask) {
+    return ByteVector.fromArray(species, readInputBytes(input, inPos, 
mask.trueCount()), 0, mask);
+  }
+
+  private static byte[] readInputBytes(ByteBuffer input, int inPos, int 
byteCount) {
+    byte[] bytes = new byte[byteCount];
+    ByteBuffer source = input.duplicate();
+    source.position(inPos);
+    source.get(bytes);

Review Comment:
   Agreed on the regression concern, and the heap-buffer fast path is in 
dffb6e6f6 — `input.hasArray()` returns `ByteVector.fromArray(species, 
input.array(), input.arrayOffset() + inPos[, mask])` directly, which HotSpot 
intrinsifies to the same `vmovdqu64` AVX-512 load as the original 
`fromByteBuffer`. Since `BytePacker` call sites in parquet-mr use 
`ByteBuffer.wrap(byte[])` (heap), this fast path covers the dominant production 
case with no copy and no `duplicate()`.
   
   I'd like to push back on the `ThreadLocal<byte[]>` part:
   
   1. It only eliminates the `byte[]` allocation, not the 
`ByteBuffer.duplicate()`, which is the larger of the two costs on the 
direct-buffer path. To eliminate both you'd need a thread-local `ByteBuffer` 
view as well, at which point the helper is doing more bookkeeping than the load 
itself.
   2. The `byte[]` is 8–64 bytes, TLAB-allocated, never escapes the helper, and 
is a strong escape-analysis candidate — HotSpot will frequently scalar-replace 
it on hot paths. `ThreadLocal.get()` (~5 ns) never scalarizes, so on the common 
JIT-hot case the `ThreadLocal` is a net loss.
   3. `ThreadLocal` of a mutable `byte[]` carries the usual operational risks 
in long-lived JVMs (thread-pool retention, classloader leak surface) for code 
that is explicitly throwaway: dffb6e6f6 adds a TODO marking 
`ByteVector.fromMemorySegment` (JEP 454, JDK 22+) as the proper replacement, 
which removes the copy entirely for both heap and direct buffers via 
`ScopedMemoryAccess` and is intrinsifiable. Once the project's minimum JDK 
reaches 22, both helpers and `readInputBytes` go away.
   
   So the change keeps the direct-buffer path as a simple `byte[] + 
duplicate()` fallback (matching the previous PR's behavior on that path), adds 
the heap fast path you suggested for the dominant case, and defers further 
optimization to the `fromMemorySegment` migration tracked in the TODO. Happy to 
add a JMH benchmark in `parquet-plugins-benchmarks` if you'd like empirical 
numbers from AVX-512 hardware before merging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to