iemejia opened a new pull request, #3512:
URL: https://github.com/apache/parquet-java/pull/3512

   ## Summary
   
   Resolves #3511.
   
   The `parquet-benchmarks` shaded jar built from current master is 
**non-functional** — it fails at runtime with `RuntimeException: Unable to find 
the resource: /META-INF/BenchmarkList`. This PR fixes that and adds 11 JMH 
benchmarks covering the encode/decode paths exercised by the open performance 
PRs, so reviewers can reproduce the reported numbers.
   
   ## What's broken on master
   
   `parquet-benchmarks/pom.xml` is missing two pieces of configuration:
   
   - `maven-compiler-plugin` lacks the `annotationProcessorPaths` / 
`annotationProcessors` config for `jmh-generator-annprocess`, so the JMH 
annotation processor never runs and `META-INF/BenchmarkList` / 
`META-INF/CompilerHints` are never generated.
   - `maven-shade-plugin` lacks `AppendingTransformer` entries for those two 
resources, so even if generated they would be dropped during shading.
   
   Both problems are fixed in this PR.
   
   ## Benchmarks added
   
   11 new files in 
`parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/`:
   
   | Benchmark | Coverage |
   |---|---|
   | `IntEncodingBenchmark` | int encode/decode: PLAIN, DELTA_BINARY_PACKED, 
BYTE_STREAM_SPLIT, RLE, DICTIONARY |
   | `BinaryEncodingBenchmark` | Binary write/read paths, parameterized on 
length and cardinality |
   | `ByteStreamSplitEncodingBenchmark` / `ByteStreamSplitDecodingBenchmark` | 
BSS for float / double / int / long |
   | `FixedLenByteArrayEncodingBenchmark` | FLBA encode/decode |
   | `FileReadBenchmark` / `FileWriteBenchmark` | CPU-focused file-level 
benchmarks |
   | `RowGroupFlushBenchmark` | Flush path |
   | `ConcurrentReadWriteBenchmark` | Multi-threaded read/write throughput |
   | `BlackHoleOutputFile` | `OutputFile` that discards bytes — isolates CPU 
from I/O |
   | `TestDataFactory` | Shared data-generation utilities |
   
   ## Validation
   
   After this PR, the shaded jar is runnable and registers 87 benchmarks:
   
   ```
   $ ./mvnw clean package -pl parquet-benchmarks -DskipTests \
       -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
   $ java -jar parquet-benchmarks/target/parquet-benchmarks.jar -l | wc -l
   87
   ```
   
   Sanity check — `IntEncodingBenchmark.decodePlain` reproduces the master 
baseline cited in #3493/#3494 (~91M ops/s on JDK 21, JMH 1.37, 3 warmup + 5 
measurement iterations):
   
   ```
   Benchmark                            (dataPattern)   Mode  Cnt         Score 
        Error  Units
   IntEncodingBenchmark.decodePlain        SEQUENTIAL  thrpt    5  93528419.575 
± 1472148.214  ops/s
   IntEncodingBenchmark.decodePlain            RANDOM  thrpt    5  90908523.483 
± 1978982.394  ops/s
   IntEncodingBenchmark.decodePlain   LOW_CARDINALITY  thrpt    5  92672978.255 
± 2071927.851  ops/s
   IntEncodingBenchmark.decodePlain  HIGH_CARDINALITY  thrpt    5  90770177.655 
± 2427904.955  ops/s
   ```
   
   ## Out of scope (deferred)
   
   Modernization of the existing `ReadBenchmarks` / `WriteBenchmarks` / 
`NestedNullWritingBenchmarks` (Hadoop-free `LocalInputFile`, parameterization, 
JMH-idiomatic state setup) is a separate concern and will be proposed in a 
follow-up PR.
   
   ## Follow-up
   
   Once this lands, each open perf PR (#3494, #3496, #3500, #3504, #3506, 
#3510) will be updated with a one-line "How to reproduce" snippet referencing 
the relevant `*Benchmark` class.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to