iemejia commented on code in PR #3476:
URL: https://github.com/apache/parquet-java/pull/3476#discussion_r3106683614
##########
parquet-plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java:
##########
@@ -3175,13 +3171,30 @@ public final void unpackValuesUsingVector(final byte[]
in, final int inPos, fina
public final void unpackValuesUsingVector(
final ByteBuffer in, final int inPos, final int[] out, final int
outPos) {
- ByteVector byteVector = ByteVector.fromByteBuffer(B512, in, inPos,
in.order(), inp_mask);
+ ByteVector byteVector = fromByteBuffer(B512, in, inPos, inp_mask);
IntVector tempRes1 =
byteVector.rearrange(perm_mask0).reinterpretAsInts();
tempRes1.intoArray(out, outPos, out_mask);
}
}
+ private static ByteVector fromByteBuffer(VectorSpecies<Byte> species,
ByteBuffer input, int inPos) {
+ return ByteVector.fromArray(species, readInputBytes(input, inPos,
species.length()), 0);
+ }
+
+ private static ByteVector fromByteBuffer(
+ VectorSpecies<Byte> species, ByteBuffer input, int inPos,
VectorMask<Byte> mask) {
+ return ByteVector.fromArray(species, readInputBytes(input, inPos,
mask.trueCount()), 0, mask);
+ }
Review Comment:
Good catch — this was wrong. `mask.trueCount()` only happens to be safe at
the call sites in this file because every mask is built as a contiguous-prefix
mask, but `ByteVector.fromArray(species, array, 0, mask)` requires
`array.length >= species.length()` for any mask shape, and the masked overload
of OpenJDK 17's `ByteVector.fromByteBuffer` actually loads the full window
unconditionally and then `blend`s with the mask — not a prefix copy. The fix in
dffb6e6f6 is exactly that: always allocate `species.length()` bytes and let
`fromArray(..., mask)` apply the mask, matching JDK semantics for arbitrary
masks. Verified the AVX-512 unpack tests still pass on JDK 17 / 21 / 25.
##########
parquet-plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java:
##########
@@ -3175,13 +3171,30 @@ public final void unpackValuesUsingVector(final byte[]
in, final int inPos, fina
public final void unpackValuesUsingVector(
final ByteBuffer in, final int inPos, final int[] out, final int
outPos) {
- ByteVector byteVector = ByteVector.fromByteBuffer(B512, in, inPos,
in.order(), inp_mask);
+ ByteVector byteVector = fromByteBuffer(B512, in, inPos, inp_mask);
IntVector tempRes1 =
byteVector.rearrange(perm_mask0).reinterpretAsInts();
tempRes1.intoArray(out, outPos, out_mask);
}
}
+ private static ByteVector fromByteBuffer(VectorSpecies<Byte> species,
ByteBuffer input, int inPos) {
+ return ByteVector.fromArray(species, readInputBytes(input, inPos,
species.length()), 0);
+ }
+
+ private static ByteVector fromByteBuffer(
+ VectorSpecies<Byte> species, ByteBuffer input, int inPos,
VectorMask<Byte> mask) {
+ return ByteVector.fromArray(species, readInputBytes(input, inPos,
mask.trueCount()), 0, mask);
+ }
+
+ private static byte[] readInputBytes(ByteBuffer input, int inPos, int
byteCount) {
+ byte[] bytes = new byte[byteCount];
+ ByteBuffer source = input.duplicate();
+ source.position(inPos);
+ source.get(bytes);
Review Comment:
Agreed on the regression concern, and the heap-buffer fast path is in
dffb6e6f6 — `input.hasArray()` returns `ByteVector.fromArray(species,
input.array(), input.arrayOffset() + inPos[, mask])` directly, which HotSpot
intrinsifies to the same `vmovdqu64` AVX-512 load as the original
`fromByteBuffer`. Since `BytePacker` call sites in parquet-mr use
`ByteBuffer.wrap(byte[])` (heap), this fast path covers the dominant production
case with no copy and no `duplicate()`.
I'd like to push back on the `ThreadLocal<byte[]>` part:
1. It only eliminates the `byte[]` allocation, not the
`ByteBuffer.duplicate()`, which is the larger of the two costs on the
direct-buffer path. To eliminate both you'd need a thread-local `ByteBuffer`
view as well, at which point the helper is doing more bookkeeping than the load
itself.
2. The `byte[]` is 8–64 bytes, TLAB-allocated, never escapes the helper, and
is a strong escape-analysis candidate — HotSpot will frequently scalar-replace
it on hot paths. `ThreadLocal.get()` (~5 ns) never scalarizes, so on the common
JIT-hot case the `ThreadLocal` is a net loss.
3. `ThreadLocal` of a mutable `byte[]` carries the usual operational risks
in long-lived JVMs (thread-pool retention, classloader leak surface) for code
that is explicitly throwaway: dffb6e6f6 adds a TODO marking
`ByteVector.fromMemorySegment` (JEP 454, JDK 22+) as the proper replacement,
which removes the copy entirely for both heap and direct buffers via
`ScopedMemoryAccess` and is intrinsifiable. Once the project's minimum JDK
reaches 22, both helpers and `readInputBytes` go away.
So the change keeps the direct-buffer path as a simple `byte[] +
duplicate()` fallback (matching the previous PR's behavior on that path), adds
the heap fast path you suggested for the dominant case, and defers further
optimization to the `fromMemorySegment` migration tracked in the TODO. Happy to
add a JMH benchmark in `parquet-plugins-benchmarks` if you'd like empirical
numbers from AVX-512 hardware before merging.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]