jerryshao opened a new pull request, #9588:
URL: https://github.com/apache/gravitino/pull/9588
This commit implements a built-in job template for rewriting Iceberg
table
data files, which supports binpack, sort, and z-order strategies for
table
optimization.
Key Features:
- Named argument parser supporting flexible parameter combinations
- Calls Iceberg's native rewrite_data_files stored procedure
- Supports all rewrite strategies: binpack, sort, z-order
- Configurable options for file sizes, thresholds, and behavior
- Template-based configuration for Spark and Iceberg catalogs
- Handles both Iceberg 1.6.1 (4 columns) and newer versions (5 columns)
Implementation:
- IcebergRewriteDataFilesJob.java (335 lines)
- Template name: builtin-iceberg-rewrite-data-files
- Version: v1
- Arguments: --catalog, --table, --strategy, --sort-order, --where,
--options
- Spark configs for runtime and Iceberg catalog setup
- BuiltInJobTemplateProvider.java (modified)
- Registered new IcebergRewriteDataFilesJob
- build.gradle.kts (modified)
- Added Iceberg Spark runtime dependency (1.6.1)
- Added Spark, Scala, and Hadoop test dependencies
Tests (41 tests, all passing):
- TestIcebergRewriteDataFilesJob.java (33 tests, 429 lines)
- Template structure validation
- Argument parsing (required, optional, empty values,
order-independent)
- JSON options parsing (single, multiple, boolean, empty)
- SQL generation (minimal, with strategy, sort, where, options, all
params)
- TestIcebergRewriteDataFilesJobWithSpark.java (8 tests, 229 lines)
- Real Spark session integration tests
- Executes actual Iceberg rewrite_data_files procedures
- Validates data integrity after rewrite operations
- Tests all parameter combinations with live Iceberg catalog
Usage Examples:
--catalog iceberg_prod --table db.sample
--catalog iceberg_prod --table db.sample --strategy sort \
--sort-order 'id DESC NULLS LAST'
--catalog iceberg_prod --table db.sample --strategy sort \
--sort-order 'zorder(user_id, event_type, timestamp)'
--catalog iceberg_prod --table db.sample --where 'year = 2024' \
--options '{"min-input-files":"2","remove-dangling-deletes":"true"}'
Fix: #9543
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]