iemejia opened a new pull request, #3500:
URL: https://github.com/apache/parquet-java/pull/3500

   ## Summary
   
   Closes #3499.
   
   Caches `Binary.hashCode()` per instance for non-reused (immutable-backing) 
`Binary` values. Eliminates repeated full-buffer hash recomputation during 
`PLAIN_DICTIONARY` encoding, where the same key is hashed many times across a 
100k-value page. Reused (mutable-backing) instances skip the cache to preserve 
their existing semantics.
   
   Uses the `java.lang.String.hashCode()` idiom — a single `int` field with 
sentinel `0` meaning "not yet computed" — so the cache is **race-safe without 
`volatile`** (concurrent first calls compute the same deterministic value; 
either ordering is correct).
   
   ## Benchmark
   
   `BinaryEncodingBenchmark.encodeDictionary`, 100k BINARY values per 
invocation, JDK 18, JMH `-wi 5 -i 10 -f 3` (30 samples per row):
   
   | cardinality | stringLength | Before (ops/s) | After (ops/s) | Improvement |
   
|-------------|--------------:|---------------:|--------------:|------------:|
   | LOW         |  10           |     13,170,110 |    20,203,480 | **+53%**    
|
   | LOW         | 100           |      2,955,460 |    18,048,610 | **+511%**   
|
   | LOW         | 1000          |        300,693 |    21,933,470 | **+7193% 
(72.9x)** |
   | HIGH        |  10           |        847,657 |     1,336,238 | **+58%**    
|
   | HIGH        | 100           |        418,327 |     1,323,284 | **+216%**   
|
   | HIGH        | 1000          |         72,553 |     1,296,679 | **+1687% 
(17.9x)** |
   
   The relative gain grows with string length (per-call hash work is O(N), 
cache lookup is O(1)) and with low cardinality (each unique key is hashed many 
more times).
   
   **Negative control**: `encodePlain` (writes Binary without dictionary 
lookups, so doesn't exercise `hashCode`) is unchanged within ±2.5% across all 
parameter combinations. **Allocation rate per op** (`gc.alloc.rate.norm`) is 
identical between baseline and optimized — speedup is pure CPU saved.
   
   ## Implementation notes
   
   - Single field added: `transient int cachedHashCode` on `Binary` 
(package-private so the three nested subclasses can read it directly on the hot 
path; inherited private fields are not accessible from nested subclasses 
without a method-call indirection).
   - The cached value is gated on `!isBackingBytesReused` inside a small 
package-private `cacheHashCode(int)` helper that runs only on the cache-miss 
path.
   - 2 new tests in `TestBinary`:
     - `testHashCodeCachedForConstantBinary`: constant Binary returns stable 
`hashCode`, equal across the three impls (`ByteArraySliceBackedBinary`, 
`ByteArrayBackedBinary`, `ByteBufferBackedBinary`).
     - `testHashCodeNotCachedForReusedBinary`: reused Binary returns the new 
hash after the backing buffer is replaced.
   
   ## Validation
   
   - `parquet-column`: 575 tests pass (was 573; +2 new tests for the cache).
   - Built with `-Dspotless.check.skip=true -Drat.skip=true 
-Djapicmp.skip=true`.
   
   ## Related
   
   This is the third in a small series of focused performance PRs from work in 
https://github.com/iemejia/parquet-perf. Previous: #3494 (PlainValuesReader), 
#3496 (PlainValuesWriter).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to