Shekharrajak opened a new pull request, #3519:
URL: https://github.com/apache/datafusion-comet/pull/3519

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   Ref https://github.com/apache/datafusion-comet/issues/3371
   
   ## PR Description
   
   
   ## Rationale for this change
   
   Iceberg table compaction using Spark's default `rewriteDataFiles()` action 
is slow due to Spark shuffle and task scheduling overhead. This PR adds native 
Rust-based compaction using DataFusion for direct Parquet read/write, achieving 
**1.5-1.8x speedup** over Spark's default compaction.
   
   ## What changes are included in this PR?
   
   - **Native Rust compaction**: DataFusion-based Parquet read/write via JNI 
([iceberg_compaction_jni.rs]
   - **Scala integration**: `CometNativeCompaction` class that executes native 
compaction (**Executes native scan + write via JNI**) and **commits via Iceberg 
Java API**
   - **Configuration**: `spark.comet.iceberg.compaction.enabled` config option
   - **Benchmark**: TPC-H based compaction benchmark comparing Spark vs Native 
performance
   
   ## How are these changes tested?
   
   - Unit tests in `CometIcebergCompactionSuite` covering:
     - Non-partitioned table compaction
     - Partitioned table compaction (bucket, truncate, date partitions)
     - Data correctness verification after compaction
   - TPC-H benchmark (`CometIcebergTPCCompactionBenchmark`) measuring 
performance on lineitem, orders, customer tables
   - Manual testing with SF1 TPC-H data showing:
     - lineitem (6M rows): 7.2s → 4.4s (1.6x)
     - orders (1.5M rows): 1.5s → 0.9s (1.8x)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to