kszucs opened a new pull request, #2375:
URL: https://github.com/apache/iceberg-rust/pull/2375

   Adds two opt-in capabilities aimed at storing Iceberg tables on
   HuggingFace Hub with content-defined chunking for efficient deduplication.
   
   ## HuggingFace Hub storage backend
   
   New `opendal-hf` feature on `iceberg-storage-opendal` (off by default,
   included in `opendal-all`) that wires HuggingFace's OpenDAL service into
   `FileIO`. Paths use the form:
   
     hf://[<repo_type>/]<owner>/<repo>[@<revision>]/<path_in_repo>
   
   where `repo_type` is one of `models` (default), `datasets`, `spaces`, or
   `buckets` (XET-backed object storage). Configuration is passed via
   `FileIOBuilder` properties:
   
     - `hf.token`     — API token (required for private repos / writes)
     - `hf.endpoint`  — Hub endpoint, defaults to https://huggingface.co
     - `hf.revision`  — fallback revision when a path has no `@<revision>`
   
   The `OpenDalResolvingStorage` recognises the `hf` scheme and lazily
   constructs a per-scheme storage instance. `delete_stream` groups paths
   by `<repo_type>/<repo_id>` so that bucket and dataset paths to the same
   repo do not share an operator.
   
   ## CDC (content-defined chunking) table properties
   
   New table properties under the `parquet.cdc.*` namespace:
   
     - `parquet.cdc.min_chunk_size` (bytes)
     - `parquet.cdc.max_chunk_size` (bytes)
     - `parquet.cdc.norm_level`     (gearhash bit adjustment, i32)
   
   CDC is implicitly enabled if any `parquet.cdc.*` property is present;
   unset fields fall back to `parquet::file::properties::CdcOptions::default()`
   so the Iceberg layer stays in sync with parquet's own defaults. A new
   `iceberg::writer::create_writer_properties()` helper builds parquet
   `WriterProperties` from `TableProperties`, applying CDC options when
   configured. The DataFusion physical write plan uses this helper, so
   tables created through DataFusion automatically pick up CDC settings.
   
   ## Other changes
   
   - `iceberg-storage-opendal`: migrated S3 credential plumbing from
     `reqsign 0.16` to `reqsign-aws-v4` / `reqsign-core` 3.0 (required
     by the opendal version that adds HF support). `CustomAwsCredentialLoader`
     now wraps any `ProvideCredential<Credential = AwsCredential>` rather
     than `Arc<dyn AwsCredentialLoad>`.
   - `OpenDalResolvingStorage`: replaced `opendal::Scheme` with a canonical
     `&'static str` cache key, removing the dependency on opendal's `Scheme`
     enum (which no longer exposes all needed variants in 0.56).
   - `OpenDalStorage::remove_prefix`: switched from `remove_all` to
     `delete_with(...).recursive(true)` for the new opendal API.
   
   ## Tests
   
   - Rust unit tests for `HfUri` parsing (repo types, revisions including
     `refs/convert/parquet` and `refs/pr/N`, percent-encoded refs, edge
     cases) and CDC property parsing.
   - Rust integration tests in `crates/storage/opendal/tests/file_io_hf_test.rs`
     guarded on `HF_OPENDAL_TOKEN`, `HF_OPENDAL_BUCKET`, `HF_OPENDAL_DATASET`
     env vars; tests skip if any required env var is unset.
   - Python tests in `bindings/python/tests/test_hf_and_cdc.py` covering CDC
     property persistence, PyIceberg writes with CDC, DataFusion read-back,
     and HF credentials end-to-end (skipped without `HF_OPENDAL_TOKEN` and
     `HF_OPENDAL_TABLE_METADATA`).
   
   ## Dependencies
   
   `opendal` is pinned to a git revision of apache/opendal that includes
   the `services-hf` backend. Once a release containing HF support is
   published on crates.io, this should be flipped back to a version pin.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to