jerryshao opened a new pull request, #9588:
URL: https://github.com/apache/gravitino/pull/9588

       This commit implements a built-in job template for rewriting Iceberg 
table
       data files, which supports binpack, sort, and z-order strategies for 
table
       optimization.
   
       Key Features:
       - Named argument parser supporting flexible parameter combinations
       - Calls Iceberg's native rewrite_data_files stored procedure
       - Supports all rewrite strategies: binpack, sort, z-order
       - Configurable options for file sizes, thresholds, and behavior
       - Template-based configuration for Spark and Iceberg catalogs
       - Handles both Iceberg 1.6.1 (4 columns) and newer versions (5 columns)
   
       Implementation:
       - IcebergRewriteDataFilesJob.java (335 lines)
         - Template name: builtin-iceberg-rewrite-data-files
         - Version: v1
         - Arguments: --catalog, --table, --strategy, --sort-order, --where, 
--options
         - Spark configs for runtime and Iceberg catalog setup
   
       - BuiltInJobTemplateProvider.java (modified)
         - Registered new IcebergRewriteDataFilesJob
   
       - build.gradle.kts (modified)
         - Added Iceberg Spark runtime dependency (1.6.1)
         - Added Spark, Scala, and Hadoop test dependencies
   
       Tests (41 tests, all passing):
       - TestIcebergRewriteDataFilesJob.java (33 tests, 429 lines)
         - Template structure validation
         - Argument parsing (required, optional, empty values, 
order-independent)
         - JSON options parsing (single, multiple, boolean, empty)
         - SQL generation (minimal, with strategy, sort, where, options, all 
params)
   
       - TestIcebergRewriteDataFilesJobWithSpark.java (8 tests, 229 lines)
         - Real Spark session integration tests
         - Executes actual Iceberg rewrite_data_files procedures
         - Validates data integrity after rewrite operations
         - Tests all parameter combinations with live Iceberg catalog
   
       Usage Examples:
       --catalog iceberg_prod --table db.sample
   
       --catalog iceberg_prod --table db.sample --strategy sort \
         --sort-order 'id DESC NULLS LAST'
   
       --catalog iceberg_prod --table db.sample --strategy sort \
         --sort-order 'zorder(user_id, event_type, timestamp)'
   
       --catalog iceberg_prod --table db.sample --where 'year = 2024' \
         --options '{"min-input-files":"2","remove-dangling-deletes":"true"}'
   
       Fix: #9543
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to