iemejia commented on code in PR #3476:
URL: https://github.com/apache/parquet-java/pull/3476#discussion_r3106683625


##########
parquet-plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java:
##########
@@ -3175,13 +3171,30 @@ public final void unpackValuesUsingVector(final byte[] 
in, final int inPos, fina
 
     public final void unpackValuesUsingVector(
         final ByteBuffer in, final int inPos, final int[] out, final int 
outPos) {
-      ByteVector byteVector = ByteVector.fromByteBuffer(B512, in, inPos, 
in.order(), inp_mask);
+      ByteVector byteVector = fromByteBuffer(B512, in, inPos, inp_mask);
       IntVector tempRes1 = 
byteVector.rearrange(perm_mask0).reinterpretAsInts();
 
       tempRes1.intoArray(out, outPos, out_mask);
     }
   }
 
+  private static ByteVector fromByteBuffer(VectorSpecies<Byte> species, 
ByteBuffer input, int inPos) {
+    return ByteVector.fromArray(species, readInputBytes(input, inPos, 
species.length()), 0);
+  }
+
+  private static ByteVector fromByteBuffer(
+      VectorSpecies<Byte> species, ByteBuffer input, int inPos, 
VectorMask<Byte> mask) {
+    return ByteVector.fromArray(species, readInputBytes(input, inPos, 
mask.trueCount()), 0, mask);
+  }
+
+  private static byte[] readInputBytes(ByteBuffer input, int inPos, int 
byteCount) {
+    byte[] bytes = new byte[byteCount];
+    ByteBuffer source = input.duplicate();
+    source.position(inPos);
+    source.get(bytes);

Review Comment:
   Agreed on the regression concern, and the heap-buffer fast path is in 
dffb6e6f6 — `input.hasArray()` returns `ByteVector.fromArray(species, 
input.array(), input.arrayOffset() + inPos[, mask])` directly, which HotSpot 
intrinsifies to the same `vmovdqu64` AVX-512 load as the original 
`fromByteBuffer`. Since `BytePacker` call sites in parquet-mr use 
`ByteBuffer.wrap(byte[])` (heap), this fast path covers the dominant production 
case with no copy and no `duplicate()`.
   
   I'd like to push back on the `ThreadLocal<byte[]>` part:
   
   1. It only eliminates the `byte[]` allocation, not the 
`ByteBuffer.duplicate()`, which is the larger of the two costs on the 
direct-buffer path. To eliminate both you'd need a thread-local `ByteBuffer` 
view as well, at which point the helper is doing more bookkeeping than the load 
itself.
   2. The `byte[]` is 8–64 bytes, TLAB-allocated, never escapes the helper, and 
is a strong escape-analysis candidate — HotSpot will frequently scalar-replace 
it on hot paths. `ThreadLocal.get()` (~5 ns) never scalarizes, so on the common 
JIT-hot case the `ThreadLocal` is a net loss.
   3. `ThreadLocal` of a mutable `byte[]` carries the usual operational risks 
in long-lived JVMs (thread-pool retention, classloader leak surface) for code 
that is explicitly throwaway: dffb6e6f6 adds a TODO marking 
`ByteVector.fromMemorySegment` (JEP 454, JDK 22+) as the proper replacement, 
which removes the copy entirely for both heap and direct buffers via 
`ScopedMemoryAccess` and is intrinsifiable. Once the project's minimum JDK 
reaches 22, both helpers and `readInputBytes` go away.
   
   So the change keeps the direct-buffer path as a simple `byte[] + 
duplicate()` fallback (matching the previous PR's behavior on that path), adds 
the heap fast path you suggested for the dominant case, and defers further 
optimization to the `fromMemorySegment` migration tracked in the TODO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to