iemejia opened a new issue, #3499:
URL: https://github.com/apache/parquet-java/issues/3499

   ### Describe the enhancement requested
   
   PLAIN_DICTIONARY encoding of `BINARY` columns repeatedly hashes `Binary` 
keys during dictionary map lookups (`Object2IntMap.put` / `getInt` calls inside 
`DictionaryValuesWriter.PlainBinaryDictionaryValuesWriter`). Each call to 
`Binary.hashCode()` recomputes the FNV-style hash byte-by-byte across the full 
backing buffer:
   
   ```java
   private static final int hashCode(byte[] array, int offset, int length) {
     int result = 1;
     for (int i = offset; i < offset + length; i++) {
       byte b = array[i];
       result = 31 * result + b;
     }
     return result;
   }
   ```
   
   For columns with many repeated values, the same `Binary` instance is hashed 
many times — once per dictionary lookup attempt across the entire page. As 
string length grows, the cost is dominated by these recomputations.
   
   JMH (`BinaryEncodingBenchmark.encodeDictionary`, 100k values per invocation, 
JDK 18, JMH -wi 5 -i 10 -f 3) on `master`:
   
   | cardinality | stringLength | ops/s |
   |-------------|--------------:|--------:|
   | LOW         |  10           |  13.2M |
   | LOW         | 100           |   3.0M |
   | LOW         | 1000          | 300K |
   | HIGH        |  10           | 848K |
   | HIGH        | 100           | 418K |
   | HIGH        | 1000          |  72.5K |
   
   The 1000-byte LOW case spending 3 ms to encode 100k values is essentially 
all hashing work.
   
   ### Proposal
   
   Cache `hashCode()` per `Binary` instance using the 
`java.lang.String.hashCode()` idiom — a single `int` field with sentinel `0` 
meaning "not yet computed":
   
   ```java
   @Override
   public int hashCode() {
     int h = cachedHashCode;
     if (h != 0) {
       return h;
     }
     return cacheHashCode(Binary.hashCode(value, offset, length));
   }
   ```
   
   Properties:
   
   - **Race-safe without `volatile`**: the value computed is a deterministic 
function of immutable bytes, so any two threads that race on the first call 
produce the same `int` and either ordering is correct (per `java.lang.String`).
   - **Reused (mutable-buffer) Binary instances do not cache**, preserving the 
current contract that mutating the backing array between calls produces a 
different hash.
   - **Hash that genuinely equals 0 is recomputed every call** — acceptably 
rare and still semantically correct.
   
   Expected speedup (JMH same configuration, 30 samples per row):
   
   | cardinality | stringLength | Before | After | Δ |
   |-------------|--------------:|-------:|------:|---:|
   | LOW         |  10           | 13.2M | 20.2M | **+53%** |
   | LOW         | 100           |  3.0M | 18.0M | **+511%** |
   | LOW         | 1000          | 300K  | 21.9M | **+7193% (72.9x)** |
   | HIGH        |  10           | 848K  | 1.34M | **+58%** |
   | HIGH        | 100           | 418K  | 1.32M | **+216%** |
   | HIGH        | 1000          | 72.5K | 1.30M | **+1687% (17.9x)** |
   
   ### Scope
   
   - Single file change to 
`parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java`.
   - Adds two unit tests in `TestBinary` for cache correctness (constant 
cached, reused not cached).
   - No public-API change. Adds one package-private field (`transient int 
cachedHashCode`) and one package-private helper (`cacheHashCode(int)`).
   - No allocation-rate change (cache is one extra `int` field on existing 
instances).
   
   ### Relation
   
   This is part of a small series of focused PRs upstreaming optimizations from 
work in [parquet-perf](https://github.com/iemejia/parquet-perf). Previous PRs 
in the series: #3494 (PlainValuesReader), #3496 (PlainValuesWriter).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to