viirya opened a new pull request, #56082:
URL: https://github.com/apache/spark/pull/56082
### What changes were proposed in this pull request?
Six bulk-fill methods on the column vectors implement constant-value
fills with degenerate per-element loops. This PR replaces them with
intrinsic substitutions:
| Method | Substitution |
| --- | --- |
| `OnHeapColumnVector.putBooleans(rowId, count, value)` |
`Arrays.fill(byte[], ..., (byte) v)` |
| `OnHeapColumnVector.putBytes(rowId, count, value)` | `Arrays.fill(byte[],
...)` |
| `OnHeapColumnVector.putShorts(rowId, count, value)` |
`Arrays.fill(short[], ...)` |
| `OnHeapColumnVector.putLongs(rowId, count, value)` | `Arrays.fill(long[],
...)` |
| `OffHeapColumnVector.putBooleans(rowId, count, value)` |
`Platform.setMemory` with small-count fallback |
| `OffHeapColumnVector.putBytes(rowId, count, value)` | `Platform.setMemory`
with small-count fallback |
The two OffHeap methods share a `SET_MEMORY_THRESHOLD = 128` constant.
Below the threshold, an inline byte loop avoids the JNI fixed cost of
`Unsafe.setMemory`; at or above, `setMemory` dominates and the gain
accelerates up to ~10x at `count >= 4096`.
### Why are the changes needed?
The bulk-fill APIs on `WritableColumnVector` are the natural call to
make from any column writer, but their implementations were per-element
loops. Switching to intrinsics:
- `Arrays.fill` is backed by HotSpot's `_jbyte_fill` / `_jshort_fill` /
`_jlong_fill` intrinsic stubs; on byte/short arrays C2 can usually
auto-vectorize the original loop and gains are modest, but for
`long[]` and at small counts the intrinsic is meaningfully faster.
- `Unsafe.setMemory` lowers to a native memset. For OffHeap byte fills
the original per-byte `Platform.putByte` loop cannot be vectorized
through the JNI call, so the gain is dramatic at large counts.
Measured on Apple M4 Max + OpenJDK 21.0.8, using a new
`WritableColumnVectorBulkFillBenchmark` (added in a separate change,
not part of this PR), Rate (M elements/s):
**OffHeap byte fills (putBytes / putBooleans)**, threshold path:
| count | baseline | patched | delta |
| ------: | -------: | ------: | ----- |
| 8 | ~1,900 | ~1,840 | parity (small-count fallback) |
| 64 | ~3,800 | ~3,760 | parity |
| 512 | ~4,150 | ~13,100 | +3.2x |
| 4,096 | ~4,340 | ~31,900 | +7.4x |
| 65,536 | ~4,275 | ~43,700 | +10.2x |
**OnHeap byte fills**:
| count | baseline | patched | delta |
| ------: | -------: | ------: | ----- |
| 8 | ~2,620 | ~3,230 | +23% |
| 64 | ~19,000 | ~25,400 | +33% |
| 512 | ~68,800 | ~86,200 | +25% |
| 4,096 | ~128,400 | ~133,300| +4% |
| 65,536 | ~143,200 | ~143,600| saturated (byte memory bandwidth) |
**OnHeap longs**: +1-14% in the small/medium range, saturated by
memory bandwidth at large counts. Included for consistency with the
byte methods.
OffHeap multi-byte fills (putShorts / putInts / putLongs / putFloats /
putDoubles) are out of scope: `Platform.setMemory` is byte-only and a
value=0 short-circuit alternative was tried and showed no measurable
gain.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests; no behavior change. Ran locally:
- `VectorizedRleValuesReaderSuite`
- `ColumnVectorSuite`
- `ColumnarBatchSuite`
- `ParquetIOSuite`
237 tests, all pass.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]