iemejia opened a new issue, #3499:
URL: https://github.com/apache/parquet-java/issues/3499
### Describe the enhancement requested
PLAIN_DICTIONARY encoding of `BINARY` columns repeatedly hashes `Binary`
keys during dictionary map lookups (`Object2IntMap.put` / `getInt` calls inside
`DictionaryValuesWriter.PlainBinaryDictionaryValuesWriter`). Each call to
`Binary.hashCode()` recomputes the FNV-style hash byte-by-byte across the full
backing buffer:
```java
private static final int hashCode(byte[] array, int offset, int length) {
int result = 1;
for (int i = offset; i < offset + length; i++) {
byte b = array[i];
result = 31 * result + b;
}
return result;
}
```
For columns with many repeated values, the same `Binary` instance is hashed
many times — once per dictionary lookup attempt across the entire page. As
string length grows, the cost is dominated by these recomputations.
JMH (`BinaryEncodingBenchmark.encodeDictionary`, 100k values per invocation,
JDK 18, JMH -wi 5 -i 10 -f 3) on `master`:
| cardinality | stringLength | ops/s |
|-------------|--------------:|--------:|
| LOW | 10 | 13.2M |
| LOW | 100 | 3.0M |
| LOW | 1000 | 300K |
| HIGH | 10 | 848K |
| HIGH | 100 | 418K |
| HIGH | 1000 | 72.5K |
The 1000-byte LOW case spending 3 ms to encode 100k values is essentially
all hashing work.
### Proposal
Cache `hashCode()` per `Binary` instance using the
`java.lang.String.hashCode()` idiom — a single `int` field with sentinel `0`
meaning "not yet computed":
```java
@Override
public int hashCode() {
int h = cachedHashCode;
if (h != 0) {
return h;
}
return cacheHashCode(Binary.hashCode(value, offset, length));
}
```
Properties:
- **Race-safe without `volatile`**: the value computed is a deterministic
function of immutable bytes, so any two threads that race on the first call
produce the same `int` and either ordering is correct (per `java.lang.String`).
- **Reused (mutable-buffer) Binary instances do not cache**, preserving the
current contract that mutating the backing array between calls produces a
different hash.
- **Hash that genuinely equals 0 is recomputed every call** — acceptably
rare and still semantically correct.
Expected speedup (JMH same configuration, 30 samples per row):
| cardinality | stringLength | Before | After | Δ |
|-------------|--------------:|-------:|------:|---:|
| LOW | 10 | 13.2M | 20.2M | **+53%** |
| LOW | 100 | 3.0M | 18.0M | **+511%** |
| LOW | 1000 | 300K | 21.9M | **+7193% (72.9x)** |
| HIGH | 10 | 848K | 1.34M | **+58%** |
| HIGH | 100 | 418K | 1.32M | **+216%** |
| HIGH | 1000 | 72.5K | 1.30M | **+1687% (17.9x)** |
### Scope
- Single file change to
`parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java`.
- Adds two unit tests in `TestBinary` for cache correctness (constant
cached, reused not cached).
- No public-API change. Adds one package-private field (`transient int
cachedHashCode`) and one package-private helper (`cacheHashCode(int)`).
- No allocation-rate change (cache is one extra `int` field on existing
instances).
### Relation
This is part of a small series of focused PRs upstreaming optimizations from
work in [parquet-perf](https://github.com/iemejia/parquet-perf). Previous PRs
in the series: #3494 (PlainValuesReader), #3496 (PlainValuesWriter).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]