Shekharrajak opened a new pull request, #3519: URL: https://github.com/apache/datafusion-comet/pull/3519
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Ref https://github.com/apache/datafusion-comet/issues/3371 ## PR Description ## Rationale for this change Iceberg table compaction using Spark's default `rewriteDataFiles()` action is slow due to Spark shuffle and task scheduling overhead. This PR adds native Rust-based compaction using DataFusion for direct Parquet read/write, achieving **1.5-1.8x speedup** over Spark's default compaction. ## What changes are included in this PR? - **Native Rust compaction**: DataFusion-based Parquet read/write via JNI ([iceberg_compaction_jni.rs] - **Scala integration**: `CometNativeCompaction` class that executes native compaction (**Executes native scan + write via JNI**) and **commits via Iceberg Java API** - **Configuration**: `spark.comet.iceberg.compaction.enabled` config option - **Benchmark**: TPC-H based compaction benchmark comparing Spark vs Native performance ## How are these changes tested? - Unit tests in `CometIcebergCompactionSuite` covering: - Non-partitioned table compaction - Partitioned table compaction (bucket, truncate, date partitions) - Data correctness verification after compaction - TPC-H benchmark (`CometIcebergTPCCompactionBenchmark`) measuring performance on lineitem, orders, customer tables - Manual testing with SF1 TPC-H data showing: - lineitem (6M rows): 7.2s → 4.4s (1.6x) - orders (1.5M rows): 1.5s → 0.9s (1.8x) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
