emilk opened a new pull request, #10075:
URL: https://github.com/apache/arrow-rs/pull/10075

   # Which issue does this PR close?
   
   - Closes #10069.
   
   # Rationale for this change
   
   `Field` and `Schema` store their metadata as `HashMap<String, String>`, 
which makes cloning them expensive (a deep clone with an allocation per entry). 
This is in contrast to basically everything else in `arrow-rs`, which uses 
`Arc` for cheap cloning. `HashMap` also has a non-deterministic iteration 
order, forcing every consumer that needs determinism (IPC encoding, `Display`, 
`Hash`, `Ord`) to collect and sort the keys first.
   
   # What changes are included in this PR?
   
   Adds a new `Metadata` type to `arrow-schema`:
   
   ```rust
   #[derive(Clone, Default, PartialEq, Eq, PartialOrd, Ord, Hash)]
   pub struct Metadata(
       // `None` means empty, so an empty `Metadata` never allocates
       Option<Arc<BTreeMap<String, String>>>,
   );
   ```
   
   * Cloning is always cheap (bumps a refcount).
   * Mutation (`insert`/`remove`/`extend`) is copy-on-write via 
`Arc::make_mut`: the underlying map is deep-cloned only if it is shared.
   * Iteration order is deterministic (sorted by key) thanks to `BTreeMap`.
   * The inner map is never empty (`None` encodes the empty map), keeping the 
derived `PartialEq`/`Ord`/`Hash` implementations consistent.
   * Familiar map API: `get`, `insert`, `remove`, `contains_key`, `len`, 
`iter`, `keys`, `values`, `clear`, `Index<&str>`, plus `From`/`Into` 
conversions for `HashMap`, `BTreeMap`, and `[(K, V); N]` arrays, and 
`FromIterator`/`Extend`/`IntoIterator` implementations.
   * `serde` serialization format is identical to the `HashMap` it replaces 
(verified by a test asserting byte-for-byte equality with postcard).
   
   `Field::metadata`, `Schema::metadata`, and `SchemaBuilder::metadata` now use 
`Metadata`. The setters (`with_metadata`, `set_metadata`, 
`Schema::new_with_metadata`) take `impl Into<Metadata>`, so existing callers 
passing a `HashMap` keep compiling, and callers can now also write 
`field.with_metadata([("key", "value")])` directly.
   
   Now-redundant sorting has been removed, since `Metadata` already iterates in 
sorted key order:
   
   * `Field::cmp`, `Field::hash`, and `Schema::hash` (which now is derived) no 
longer collect and sort keys.
   * `arrow_ipc::convert::metadata_to_fb` no longer sorts keys (IPC, Flight, 
and Parquet schema encoding funnel through this).
   * The `DataType` `Display` impl no longer sorts metadata entries.
   
   # Are these changes tested?
   
   Yes:
   
   * New unit tests for `Metadata` covering the empty/allocation-free 
invariant, insert/get/remove, copy-on-write semantics, deterministic iteration 
order, equality with std maps, `Extend`, `Debug`, and serde round-trips 
(including format compatibility with the maps it replaces).
   * The existing test suites cover the integration (`cargo test --workspace 
--all-features` passes, apart from the pre-existing 
`test_shrink_to_fit_after_concat` failure that also fails on `main` with 
`--all-features`).
   
   # Are there any user-facing changes?
   
   Yes, this is a **breaking change** to `arrow-schema`:
   
   * `Field::metadata()`, `Field::metadata_mut()`, `Schema::metadata()`, 
`SchemaBuilder::metadata()`, and `SchemaBuilder::metadata_mut()` now return 
`&Metadata`/`&mut Metadata` instead of `&HashMap<String, String>`/`&mut 
HashMap<String, String>`. The public field `Schema::metadata` is now a 
`Metadata`.
   * `Field::with_metadata`, `Field::set_metadata`, `Schema::with_metadata`, 
and `Schema::new_with_metadata` now take `impl Into<Metadata>` instead of 
`HashMap<String, String>`. Callers passing a `HashMap` continue to compile, but 
callers relying on type inference (e.g. `.with_metadata(iter.collect())` or 
`.with_metadata(Default::default())`) need an explicit type annotation.
   * `ExtensionType::try_new_from_field_metadata` and 
`arrow_ipc::convert::metadata_to_fb` now take `&Metadata` instead of 
`&HashMap<String, String>`.
   * `RecordBatch::schema_metadata_mut` now returns `&mut Metadata`.
   * Code that needs an actual `HashMap`/`BTreeMap` can convert with `.into()` 
or iterate via `.iter()`.
   
   Since `Metadata` mirrors the map API, most code (e.g. 
`schema.metadata.get("key")`, `field.metadata().is_empty()`) compiles unchanged.
   
   Additionally, metadata iteration order (e.g. in `Debug` output) is now 
deterministic (sorted by key), where it previously followed `HashMap`'s 
arbitrary order.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to