emilk opened a new pull request, #10075:
URL: https://github.com/apache/arrow-rs/pull/10075
# Which issue does this PR close?
- Closes #10069.
# Rationale for this change
`Field` and `Schema` store their metadata as `HashMap<String, String>`,
which makes cloning them expensive (a deep clone with an allocation per entry).
This is in contrast to basically everything else in `arrow-rs`, which uses
`Arc` for cheap cloning. `HashMap` also has a non-deterministic iteration
order, forcing every consumer that needs determinism (IPC encoding, `Display`,
`Hash`, `Ord`) to collect and sort the keys first.
# What changes are included in this PR?
Adds a new `Metadata` type to `arrow-schema`:
```rust
#[derive(Clone, Default, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct Metadata(
// `None` means empty, so an empty `Metadata` never allocates
Option<Arc<BTreeMap<String, String>>>,
);
```
* Cloning is always cheap (bumps a refcount).
* Mutation (`insert`/`remove`/`extend`) is copy-on-write via
`Arc::make_mut`: the underlying map is deep-cloned only if it is shared.
* Iteration order is deterministic (sorted by key) thanks to `BTreeMap`.
* The inner map is never empty (`None` encodes the empty map), keeping the
derived `PartialEq`/`Ord`/`Hash` implementations consistent.
* Familiar map API: `get`, `insert`, `remove`, `contains_key`, `len`,
`iter`, `keys`, `values`, `clear`, `Index<&str>`, plus `From`/`Into`
conversions for `HashMap`, `BTreeMap`, and `[(K, V); N]` arrays, and
`FromIterator`/`Extend`/`IntoIterator` implementations.
* `serde` serialization format is identical to the `HashMap` it replaces
(verified by a test asserting byte-for-byte equality with postcard).
`Field::metadata`, `Schema::metadata`, and `SchemaBuilder::metadata` now use
`Metadata`. The setters (`with_metadata`, `set_metadata`,
`Schema::new_with_metadata`) take `impl Into<Metadata>`, so existing callers
passing a `HashMap` keep compiling, and callers can now also write
`field.with_metadata([("key", "value")])` directly.
Now-redundant sorting has been removed, since `Metadata` already iterates in
sorted key order:
* `Field::cmp`, `Field::hash`, and `Schema::hash` (which now is derived) no
longer collect and sort keys.
* `arrow_ipc::convert::metadata_to_fb` no longer sorts keys (IPC, Flight,
and Parquet schema encoding funnel through this).
* The `DataType` `Display` impl no longer sorts metadata entries.
# Are these changes tested?
Yes:
* New unit tests for `Metadata` covering the empty/allocation-free
invariant, insert/get/remove, copy-on-write semantics, deterministic iteration
order, equality with std maps, `Extend`, `Debug`, and serde round-trips
(including format compatibility with the maps it replaces).
* The existing test suites cover the integration (`cargo test --workspace
--all-features` passes, apart from the pre-existing
`test_shrink_to_fit_after_concat` failure that also fails on `main` with
`--all-features`).
# Are there any user-facing changes?
Yes, this is a **breaking change** to `arrow-schema`:
* `Field::metadata()`, `Field::metadata_mut()`, `Schema::metadata()`,
`SchemaBuilder::metadata()`, and `SchemaBuilder::metadata_mut()` now return
`&Metadata`/`&mut Metadata` instead of `&HashMap<String, String>`/`&mut
HashMap<String, String>`. The public field `Schema::metadata` is now a
`Metadata`.
* `Field::with_metadata`, `Field::set_metadata`, `Schema::with_metadata`,
and `Schema::new_with_metadata` now take `impl Into<Metadata>` instead of
`HashMap<String, String>`. Callers passing a `HashMap` continue to compile, but
callers relying on type inference (e.g. `.with_metadata(iter.collect())` or
`.with_metadata(Default::default())`) need an explicit type annotation.
* `ExtensionType::try_new_from_field_metadata` and
`arrow_ipc::convert::metadata_to_fb` now take `&Metadata` instead of
`&HashMap<String, String>`.
* `RecordBatch::schema_metadata_mut` now returns `&mut Metadata`.
* Code that needs an actual `HashMap`/`BTreeMap` can convert with `.into()`
or iterate via `.iter()`.
Since `Metadata` mirrors the map API, most code (e.g.
`schema.metadata.get("key")`, `field.metadata().is_empty()`) compiles unchanged.
Additionally, metadata iteration order (e.g. in `Debug` output) is now
deterministic (sorted by key), where it previously followed `HashMap`'s
arbitrary order.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]