kszucs opened a new pull request, #2375:
URL: https://github.com/apache/iceberg-rust/pull/2375
Adds two opt-in capabilities aimed at storing Iceberg tables on
HuggingFace Hub with content-defined chunking for efficient deduplication.
## HuggingFace Hub storage backend
New `opendal-hf` feature on `iceberg-storage-opendal` (off by default,
included in `opendal-all`) that wires HuggingFace's OpenDAL service into
`FileIO`. Paths use the form:
hf://[<repo_type>/]<owner>/<repo>[@<revision>]/<path_in_repo>
where `repo_type` is one of `models` (default), `datasets`, `spaces`, or
`buckets` (XET-backed object storage). Configuration is passed via
`FileIOBuilder` properties:
- `hf.token` — API token (required for private repos / writes)
- `hf.endpoint` — Hub endpoint, defaults to https://huggingface.co
- `hf.revision` — fallback revision when a path has no `@<revision>`
The `OpenDalResolvingStorage` recognises the `hf` scheme and lazily
constructs a per-scheme storage instance. `delete_stream` groups paths
by `<repo_type>/<repo_id>` so that bucket and dataset paths to the same
repo do not share an operator.
## CDC (content-defined chunking) table properties
New table properties under the `parquet.cdc.*` namespace:
- `parquet.cdc.min_chunk_size` (bytes)
- `parquet.cdc.max_chunk_size` (bytes)
- `parquet.cdc.norm_level` (gearhash bit adjustment, i32)
CDC is implicitly enabled if any `parquet.cdc.*` property is present;
unset fields fall back to `parquet::file::properties::CdcOptions::default()`
so the Iceberg layer stays in sync with parquet's own defaults. A new
`iceberg::writer::create_writer_properties()` helper builds parquet
`WriterProperties` from `TableProperties`, applying CDC options when
configured. The DataFusion physical write plan uses this helper, so
tables created through DataFusion automatically pick up CDC settings.
## Other changes
- `iceberg-storage-opendal`: migrated S3 credential plumbing from
`reqsign 0.16` to `reqsign-aws-v4` / `reqsign-core` 3.0 (required
by the opendal version that adds HF support). `CustomAwsCredentialLoader`
now wraps any `ProvideCredential<Credential = AwsCredential>` rather
than `Arc<dyn AwsCredentialLoad>`.
- `OpenDalResolvingStorage`: replaced `opendal::Scheme` with a canonical
`&'static str` cache key, removing the dependency on opendal's `Scheme`
enum (which no longer exposes all needed variants in 0.56).
- `OpenDalStorage::remove_prefix`: switched from `remove_all` to
`delete_with(...).recursive(true)` for the new opendal API.
## Tests
- Rust unit tests for `HfUri` parsing (repo types, revisions including
`refs/convert/parquet` and `refs/pr/N`, percent-encoded refs, edge
cases) and CDC property parsing.
- Rust integration tests in `crates/storage/opendal/tests/file_io_hf_test.rs`
guarded on `HF_OPENDAL_TOKEN`, `HF_OPENDAL_BUCKET`, `HF_OPENDAL_DATASET`
env vars; tests skip if any required env var is unset.
- Python tests in `bindings/python/tests/test_hf_and_cdc.py` covering CDC
property persistence, PyIceberg writes with CDC, DataFusion read-back,
and HF credentials end-to-end (skipped without `HF_OPENDAL_TOKEN` and
`HF_OPENDAL_TABLE_METADATA`).
## Dependencies
`opendal` is pinned to a git revision of apache/opendal that includes
the `services-hf` backend. Once a release containing HF support is
published on crates.io, this should be flipped back to a version pin.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]