iemejia opened a new issue, #3511:
URL: https://github.com/apache/parquet-java/issues/3511
## Background
The 6 perf-optimization PRs currently open (#3494, #3496, #3500, #3504,
#3506, #3510) report headline numbers (12x decode speedup, 7193%
Binary.hashCode improvement, etc.) but cite JMH benchmarks that **do not exist
on master**. Reviewers cannot reproduce the numbers without manually copying
benchmark sources from elsewhere.
This issue tracks contributing the JMH benchmarks themselves so reviewers
can reproduce, validate, and continue measuring across future changes.
## Problems
### 1. `parquet-benchmarks` shaded jar is broken on master
A build of `parquet-benchmarks` from the current master produces a jar that
is **non-functional**:
```
$ java -jar parquet-benchmarks/target/parquet-benchmarks.jar
Exception in thread "main" java.lang.RuntimeException: ERROR: Unable to find
the resource: /META-INF/BenchmarkList
```
The `parquet-benchmarks/pom.xml` is missing two pieces of configuration:
- The `maven-compiler-plugin` lacks the `annotationProcessorPaths` /
`annotationProcessors` config for `jmh-generator-annprocess`. As a result the
JMH annotation processor never runs, and `META-INF/BenchmarkList` and
`META-INF/CompilerHints` are never generated. (Workaround: pass
`-Dmaven.compiler.proc=full`, but this is undiscoverable.)
- The `maven-shade-plugin` lacks `AppendingTransformer` entries for
`META-INF/BenchmarkList` and `META-INF/CompilerHints`. Even if the resources
were generated, shading would drop them.
### 2. No benchmarks for the optimizations under review
The 6 open perf PRs touch encode/decode paths in `parquet-column` and
`parquet-common` (PlainValuesReader/Writer, Binary.hashCode,
ByteStreamSplitValuesReader/Writer, BinaryPlainValuesReader). Master's
`parquet-benchmarks` covers only file-level read/write, not these CPU-bound
encoding paths.
## Proposal
Land the following in a single PR against `parquet-benchmarks`:
1. **pom.xml fix**: add JMH annotation-processor config +
`AppendingTransformer` entries so the shaded jar is runnable.
2. **11 new JMH benchmark files** covering the encoding/decoding paths under
optimization, plus supporting infrastructure:
- `IntEncodingBenchmark` — encode/decode with PLAIN, DELTA_BINARY_PACKED,
BYTE_STREAM_SPLIT, RLE, and dictionary, parameterized on value count and data
distribution
- `BinaryEncodingBenchmark` — Binary write/read paths (PLAIN,
dictionary), parameterized on length and cardinality
- `ByteStreamSplitEncodingBenchmark`, `ByteStreamSplitDecodingBenchmark`
— BSS encode/decode for float/double/int/long
- `FixedLenByteArrayEncodingBenchmark` — FLBA encode/decode
- `FileReadBenchmark`, `FileWriteBenchmark` — CPU-focused file-level
benchmarks (minimal I/O via temp files)
- `RowGroupFlushBenchmark` — flush-path benchmark
- `ConcurrentReadWriteBenchmark` — multi-threaded read/write throughput
- `BlackHoleOutputFile` — `OutputFile` that discards bytes, used to
isolate CPU work from I/O
- `TestDataFactory` — shared test-data generation utilities
After this lands, each existing perf PR will be amended with a one-line "How
to reproduce" snippet pointing at the relevant `*Benchmark` class.
### Out of scope (deferred)
The existing `ReadBenchmarks`, `WriteBenchmarks`, and
`NestedNullWritingBenchmarks` could be modernized (Hadoop-free
`LocalInputFile`, parameterized over compression and writer version,
JMH-idiomatic state setup). That is a separate concern and will be proposed in
a follow-up PR.
## Validation
With the proposed pom changes, the shaded jar contains a populated
`META-INF/BenchmarkList` (87 benchmarks registered) and runs cleanly. As a
sanity check, `IntEncodingBenchmark.decodePlain` reproduces the ~91M ops/s
baseline cited in #3493/#3494 (master JDK 21, JMH 1.37, 3 warmup + 5
measurement iterations).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]