jecsand838 opened a new pull request, #8348:
URL: https://github.com/apache/arrow-rs/pull/8348
# Which issue does this PR close?
This work continues arrow-avro schema resolution support and aligns behavior
with the Avro spec.
- **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the
reader/decoder, including schema resolution and type promotion.
# Rationale for this change
`arrow-avro` lacked end‑to‑end support for Avro unions and Arrow `Union`
schemas. Many Avro datasets rely on unions (i.e., `["null","string"]`, tagged
unions of different records), and without schema‐level resolution and JSON
encoding the crate could not interoperate cleanly. This PR brings union schema
resolution to parity with the Avro spec (duplicate-branch and nested‑union
checks), adds Arrow to Avro union schema conversion (with mode/type‑id
metadata), and lays groundwork for data decoding in a follow‑up.
# What changes are included in this PR?
**Schema resolution & codecs**
- Add `Codec::Union(Arc<[AvroDataType]>, UnionFields, UnionMode)` and map it
to Arrow `DataType::Union`.
- Introduce `ResolvedUnion` and extend `ResolutionInfo` with a `Union(...)`
variant to capture writer to reader branch mapping (prefers direct matches over
promotions).
- Support union defaults: permit `null` defaults for unions whose **first**
branch is `null`; reject empty unions for defaults.
- Enforce Avro spec constraints during parsing/resolution:
- Disallow nested unions.
- Disallow duplicate branch *kinds* (except distinct named
`record`/`enum`/`fixed`).
- Keep **writer** null ordering when resolving nullable 2‑branch unions
(i.e., `["null", "int"]` vs `["int", "null"]`).
- Provide stable union field names derived from branch kind (i.e., `int`,
`string`, `map`, ...) and construct dense `UnionFields` consistently.
**Arrow and Avro schema conversion**
- Implement Arrow `DataType::Union` to Avro union JSON:
- Persist Arrow union layout via metadata keys:
- `"arrowUnionMode"`: `"dense"` or `"sparse"`.
- `"arrowUnionTypeIds"`: ordered list of Arrow type IDs.
- Attach union‑level metadata to the **first non‑null** branch object
(Avro JSON can’t carry attributes on the union array).
- Persist additional Arrow metadata in Avro JSON:
- `"arrowBinaryView"` for `BinaryView`.
- `"arrowListView"` / `"arrowLargeList"` for list view types.
- Reject invalid output shapes (i.e., a union branch that is itself an Avro
union).
**Reader/decoder stub**
- Return a clear error for union **value** decoding in `RecordDecoder`
(schema support first; decoding to follow).
**Refactors & utilities**
- Expose `make_full_name` within the crate for union branch keying; derive
`Hash` for `PrimitiveType`; add helpers for branch de‑duplication.
# Are these changes tested?
Yes. New unit tests cover:
- Resolution across writer/reader unions and non‑unions (direct vs promoted
matches, partial coverage).
- Nullable‑union semantics (writer null ordering preserved).
- Arrow `Union` to Avro union JSON including mode/type‑id metadata and
branch shapes.
- Validation errors for duplicates and nested unions.
# Are there any user-facing changes?
N/A
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]