cmcarthur opened a new issue, #1453:
URL: https://github.com/apache/iceberg-rust/issues/1453

   ### What's the feature are you trying to implement?
   
   ## Rationale
   
   Some organizations want to use Iceberg as their data lake, but don't have 
the desire to run Spark
   alongside every catalog deployment. Rust seems like a good target for 
"single-node" table 
   maintenance operations.
   
   This issue lays out an implementation plan for implementing some of the 
standard table maintenance
   tasks defined in teh Iceberg docs: 
https://iceberg.apache.org/docs/1.9.1/maintenance/
   
   ## Design Principles
   
   1. Follow the API and implementation convention set by Spark operations. 
Where possible, follow the 
   existing API conventions laid out by Spark.
   2. Incrementalize work where possible. Each operation will run on a single 
node. Since single-node
   memory and disk is limited, the Rust implementation will "incrementalize" 
work by breaking 
   operations down into smaller chunks that can be committed / completed. For 
example: the Spark
   implementation of "DeleteOrphanFiles" will first gather all files to be 
deleted, and then
   once all files have been gathered it will concurrently delete files. These 
steps run sequentially.
   In a single-node situation, for large tables, this operation may fail due to 
memory availability,
   potentially after running for a long time and gathering up files. The Rust 
implementation of the
   same maintenance operation will provide options to delete files as they are 
identified rather
   than at the end of the job. There is some precedent for this with the 
`partial-progress.max-commits`
   configuration option in the "RewriteDataFiles" operation.
   3. Develop a low-level API that can be compiled into a binary: unlike Spark, 
where FileIOs can
   be configured completely through configuration options, this will provide 
lower level primitives
   in the form of traits. Configuring FileIOs, via, for example, a 
configuration file, is a separate
   concern.
   4. Allow for extensibility: use traits with `dyn X` arguments to ensure that 
these operations
   work with current and future FileIOs.
   
   ## Implementation
   
   This initial implementation will focus on three maintenance tasks:
   
   1. Expire Snapshots
   2. Rewrite Manifests
   3. Remove Orphan Files
   
   Other operations are well-suited to re-implementation in Rust, but these are 
(in my view) the
   critical baseline operations that must run to keep the Iceberg metadata in a 
healthy state.
   
   I will open separate issues for each operation and attach PRs.
   
   ### Willingness to contribute
   
   I can contribute to this feature independently


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to