iemejia opened a new pull request, #3512:
URL: https://github.com/apache/parquet-java/pull/3512
## Summary
Resolves #3511.
The `parquet-benchmarks` shaded jar built from current master is
**non-functional** — it fails at runtime with `RuntimeException: Unable to find
the resource: /META-INF/BenchmarkList`. This PR fixes that and adds 11 JMH
benchmarks covering the encode/decode paths exercised by the open performance
PRs, so reviewers can reproduce the reported numbers.
## What's broken on master
`parquet-benchmarks/pom.xml` is missing two pieces of configuration:
- `maven-compiler-plugin` lacks the `annotationProcessorPaths` /
`annotationProcessors` config for `jmh-generator-annprocess`, so the JMH
annotation processor never runs and `META-INF/BenchmarkList` /
`META-INF/CompilerHints` are never generated.
- `maven-shade-plugin` lacks `AppendingTransformer` entries for those two
resources, so even if generated they would be dropped during shading.
Both problems are fixed in this PR.
## Benchmarks added
11 new files in
`parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/`:
| Benchmark | Coverage |
|---|---|
| `IntEncodingBenchmark` | int encode/decode: PLAIN, DELTA_BINARY_PACKED,
BYTE_STREAM_SPLIT, RLE, DICTIONARY |
| `BinaryEncodingBenchmark` | Binary write/read paths, parameterized on
length and cardinality |
| `ByteStreamSplitEncodingBenchmark` / `ByteStreamSplitDecodingBenchmark` |
BSS for float / double / int / long |
| `FixedLenByteArrayEncodingBenchmark` | FLBA encode/decode |
| `FileReadBenchmark` / `FileWriteBenchmark` | CPU-focused file-level
benchmarks |
| `RowGroupFlushBenchmark` | Flush path |
| `ConcurrentReadWriteBenchmark` | Multi-threaded read/write throughput |
| `BlackHoleOutputFile` | `OutputFile` that discards bytes — isolates CPU
from I/O |
| `TestDataFactory` | Shared data-generation utilities |
## Validation
After this PR, the shaded jar is runnable and registers 87 benchmarks:
```
$ ./mvnw clean package -pl parquet-benchmarks -DskipTests \
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
$ java -jar parquet-benchmarks/target/parquet-benchmarks.jar -l | wc -l
87
```
Sanity check — `IntEncodingBenchmark.decodePlain` reproduces the master
baseline cited in #3493/#3494 (~91M ops/s on JDK 21, JMH 1.37, 3 warmup + 5
measurement iterations):
```
Benchmark (dataPattern) Mode Cnt Score
Error Units
IntEncodingBenchmark.decodePlain SEQUENTIAL thrpt 5 93528419.575
± 1472148.214 ops/s
IntEncodingBenchmark.decodePlain RANDOM thrpt 5 90908523.483
± 1978982.394 ops/s
IntEncodingBenchmark.decodePlain LOW_CARDINALITY thrpt 5 92672978.255
± 2071927.851 ops/s
IntEncodingBenchmark.decodePlain HIGH_CARDINALITY thrpt 5 90770177.655
± 2427904.955 ops/s
```
## Out of scope (deferred)
Modernization of the existing `ReadBenchmarks` / `WriteBenchmarks` /
`NestedNullWritingBenchmarks` (Hadoop-free `LocalInputFile`, parameterization,
JMH-idiomatic state setup) is a separate concern and will be proposed in a
follow-up PR.
## Follow-up
Once this lands, each open perf PR (#3494, #3496, #3500, #3504, #3506,
#3510) will be updated with a one-line "How to reproduce" snippet referencing
the relevant `*Benchmark` class.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]