iemejia commented on code in PR #3476:
URL: https://github.com/apache/parquet-java/pull/3476#discussion_r3106683625
##########
parquet-plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java:
##########
@@ -3175,13 +3171,30 @@ public final void unpackValuesUsingVector(final byte[]
in, final int inPos, fina
public final void unpackValuesUsingVector(
final ByteBuffer in, final int inPos, final int[] out, final int
outPos) {
- ByteVector byteVector = ByteVector.fromByteBuffer(B512, in, inPos,
in.order(), inp_mask);
+ ByteVector byteVector = fromByteBuffer(B512, in, inPos, inp_mask);
IntVector tempRes1 =
byteVector.rearrange(perm_mask0).reinterpretAsInts();
tempRes1.intoArray(out, outPos, out_mask);
}
}
+ private static ByteVector fromByteBuffer(VectorSpecies<Byte> species,
ByteBuffer input, int inPos) {
+ return ByteVector.fromArray(species, readInputBytes(input, inPos,
species.length()), 0);
+ }
+
+ private static ByteVector fromByteBuffer(
+ VectorSpecies<Byte> species, ByteBuffer input, int inPos,
VectorMask<Byte> mask) {
+ return ByteVector.fromArray(species, readInputBytes(input, inPos,
mask.trueCount()), 0, mask);
+ }
+
+ private static byte[] readInputBytes(ByteBuffer input, int inPos, int
byteCount) {
+ byte[] bytes = new byte[byteCount];
+ ByteBuffer source = input.duplicate();
+ source.position(inPos);
+ source.get(bytes);
Review Comment:
Agreed on the regression concern, and the heap-buffer fast path is in
dffb6e6f6 — `input.hasArray()` returns `ByteVector.fromArray(species,
input.array(), input.arrayOffset() + inPos[, mask])` directly, which HotSpot
intrinsifies to the same `vmovdqu64` AVX-512 load as the original
`fromByteBuffer`. Since `BytePacker` call sites in parquet-mr use
`ByteBuffer.wrap(byte[])` (heap), this fast path covers the dominant production
case with no copy and no `duplicate()`.
I'd like to push back on the `ThreadLocal<byte[]>` part:
1. It only eliminates the `byte[]` allocation, not the
`ByteBuffer.duplicate()`, which is the larger of the two costs on the
direct-buffer path. To eliminate both you'd need a thread-local `ByteBuffer`
view as well, at which point the helper is doing more bookkeeping than the load
itself.
2. The `byte[]` is 8–64 bytes, TLAB-allocated, never escapes the helper, and
is a strong escape-analysis candidate — HotSpot will frequently scalar-replace
it on hot paths. `ThreadLocal.get()` (~5 ns) never scalarizes, so on the common
JIT-hot case the `ThreadLocal` is a net loss.
3. `ThreadLocal` of a mutable `byte[]` carries the usual operational risks
in long-lived JVMs (thread-pool retention, classloader leak surface) for code
that is explicitly throwaway: dffb6e6f6 adds a TODO marking
`ByteVector.fromMemorySegment` (JEP 454, JDK 22+) as the proper replacement,
which removes the copy entirely for both heap and direct buffers via
`ScopedMemoryAccess` and is intrinsifiable. Once the project's minimum JDK
reaches 22, both helpers and `readInputBytes` go away.
So the change keeps the direct-buffer path as a simple `byte[] +
duplicate()` fallback (matching the previous PR's behavior on that path), adds
the heap fast path you suggested for the dominant case, and defers further
optimization to the `fromMemorySegment` migration tracked in the TODO.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]