iemejia opened a new pull request, #3500:
URL: https://github.com/apache/parquet-java/pull/3500
## Summary
Closes #3499.
Caches `Binary.hashCode()` per instance for non-reused (immutable-backing)
`Binary` values. Eliminates repeated full-buffer hash recomputation during
`PLAIN_DICTIONARY` encoding, where the same key is hashed many times across a
100k-value page. Reused (mutable-backing) instances skip the cache to preserve
their existing semantics.
Uses the `java.lang.String.hashCode()` idiom — a single `int` field with
sentinel `0` meaning "not yet computed" — so the cache is **race-safe without
`volatile`** (concurrent first calls compute the same deterministic value;
either ordering is correct).
## Benchmark
`BinaryEncodingBenchmark.encodeDictionary`, 100k BINARY values per
invocation, JDK 18, JMH `-wi 5 -i 10 -f 3` (30 samples per row):
| cardinality | stringLength | Before (ops/s) | After (ops/s) | Improvement |
|-------------|--------------:|---------------:|--------------:|------------:|
| LOW | 10 | 13,170,110 | 20,203,480 | **+53%**
|
| LOW | 100 | 2,955,460 | 18,048,610 | **+511%**
|
| LOW | 1000 | 300,693 | 21,933,470 | **+7193%
(72.9x)** |
| HIGH | 10 | 847,657 | 1,336,238 | **+58%**
|
| HIGH | 100 | 418,327 | 1,323,284 | **+216%**
|
| HIGH | 1000 | 72,553 | 1,296,679 | **+1687%
(17.9x)** |
The relative gain grows with string length (per-call hash work is O(N),
cache lookup is O(1)) and with low cardinality (each unique key is hashed many
more times).
**Negative control**: `encodePlain` (writes Binary without dictionary
lookups, so doesn't exercise `hashCode`) is unchanged within ±2.5% across all
parameter combinations. **Allocation rate per op** (`gc.alloc.rate.norm`) is
identical between baseline and optimized — speedup is pure CPU saved.
## Implementation notes
- Single field added: `transient int cachedHashCode` on `Binary`
(package-private so the three nested subclasses can read it directly on the hot
path; inherited private fields are not accessible from nested subclasses
without a method-call indirection).
- The cached value is gated on `!isBackingBytesReused` inside a small
package-private `cacheHashCode(int)` helper that runs only on the cache-miss
path.
- 2 new tests in `TestBinary`:
- `testHashCodeCachedForConstantBinary`: constant Binary returns stable
`hashCode`, equal across the three impls (`ByteArraySliceBackedBinary`,
`ByteArrayBackedBinary`, `ByteBufferBackedBinary`).
- `testHashCodeNotCachedForReusedBinary`: reused Binary returns the new
hash after the backing buffer is replaced.
## Validation
- `parquet-column`: 575 tests pass (was 573; +2 new tests for the cache).
- Built with `-Dspotless.check.skip=true -Drat.skip=true
-Djapicmp.skip=true`.
## Related
This is the third in a small series of focused performance PRs from work in
https://github.com/iemejia/parquet-perf. Previous: #3494 (PlainValuesReader),
#3496 (PlainValuesWriter).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]