cmcarthur opened a new issue, #1453: URL: https://github.com/apache/iceberg-rust/issues/1453
### What's the feature are you trying to implement? ## Rationale Some organizations want to use Iceberg as their data lake, but don't have the desire to run Spark alongside every catalog deployment. Rust seems like a good target for "single-node" table maintenance operations. This issue lays out an implementation plan for implementing some of the standard table maintenance tasks defined in teh Iceberg docs: https://iceberg.apache.org/docs/1.9.1/maintenance/ ## Design Principles 1. Follow the API and implementation convention set by Spark operations. Where possible, follow the existing API conventions laid out by Spark. 2. Incrementalize work where possible. Each operation will run on a single node. Since single-node memory and disk is limited, the Rust implementation will "incrementalize" work by breaking operations down into smaller chunks that can be committed / completed. For example: the Spark implementation of "DeleteOrphanFiles" will first gather all files to be deleted, and then once all files have been gathered it will concurrently delete files. These steps run sequentially. In a single-node situation, for large tables, this operation may fail due to memory availability, potentially after running for a long time and gathering up files. The Rust implementation of the same maintenance operation will provide options to delete files as they are identified rather than at the end of the job. There is some precedent for this with the `partial-progress.max-commits` configuration option in the "RewriteDataFiles" operation. 3. Develop a low-level API that can be compiled into a binary: unlike Spark, where FileIOs can be configured completely through configuration options, this will provide lower level primitives in the form of traits. Configuring FileIOs, via, for example, a configuration file, is a separate concern. 4. Allow for extensibility: use traits with `dyn X` arguments to ensure that these operations work with current and future FileIOs. ## Implementation This initial implementation will focus on three maintenance tasks: 1. Expire Snapshots 2. Rewrite Manifests 3. Remove Orphan Files Other operations are well-suited to re-implementation in Rust, but these are (in my view) the critical baseline operations that must run to keep the Iceberg metadata in a healthy state. I will open separate issues for each operation and attach PRs. ### Willingness to contribute I can contribute to this feature independently -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
