jecsand838 opened a new issue, #9233:
URL: https://github.com/apache/arrow-rs/issues/9233
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
`arrow-avro` currently contains Arrow -> Avro schema logic in `schema.rs`
that was originally built as a *writer convenience* for when an Avro schema is
not provided. As such, it was developed to make a best effort attempt at
synthesizing an `AvroSchema` from an Arrow `Schema` (plus some Arrow schema
metadata like `avro.name` / `avro.namespace`).
As `arrow-avro` adoption grows (OCF files, SOE frames, Confluent/Apicurio
framing), we increasingly need schema behavior that is:
- **Explicit** about whether we are using a real Avro schema vs inferring
one,
- **Modular** so it’s more maintainable (today `schema.rs` is a large
multi-purpose module),
- **Correct-by-construction** so downstream consumers don’t need to patch up
inferred schemas or reimplement Avro schema editing.
Real-world pain points motivating this include:
- Schema inference from Arrow metadata alone can produce incorrect Avro
schemas for nested named types (e.g. #8928: confusion between nested record
*type name* and *field name*).
- Downstream consumers (e.g. DataFusion) want to apply column projection at
the Avro schema level without reimplementing Avro-aware projection and metadata
handling (see #8923).
- Tests/integration code often treat `{ Arrow schema +
avro.name/avro.namespace }` as sufficient, but this is not reliable for all
schemas, and brittle inference can break reader schema workflows.
- The Arrow -> Avro path needs clearer configuration points (null-union
ordering, naming strategy for generated nested types, metadata passthrough
policy, etc.), but these knobs are currently either implicit, crate-private, or
spread across helpers.
**Describe the solution you'd like**
Refactor / enhance `schema.rs` so Arrow -> Avro schema behavior is
**explicit, modular, and correct-by-construction**, with APIs that clearly
distinguish between these three fundamental schema conversion functions:
1. **"Using the real schema"** (preferred for readers): consume the Avro
writer schema (OCF header / schema registry / user-provided) and optionally
transform it into a reader schema as needed (projection, evolution).
2. **"Inferring a schema (defaults)"** (writer convenience): synthesize Avro
JSON from an Arrow `Schema` when no Avro schema JSON is provided / embedded.
3. **"Building an explicitly correct `AvroSchema` from an Arrow `Schema`
(configured builder for users)"**: add an `ArrowToAvroSchemaBuilder` for
constructing an `AvroSchema` from an Arrow `Schema` with explicit configuration
knobs.
Below are additional details for the proposed solution:
**A) Introduce `ArrowToAvroSchemaBuilder`**
Add a public builder style API along these lines:
```rust
use arrow_schema::Schema as ArrowSchema;
use arrow_avro::schema::AvroSchema;
// minimal defaults (equivalent to today's best effort inference)
let avro: AvroSchema = ArrowToAvroSchemaBuilder::new(&arrow_schema).build()?;
// explicit configuration
let avro: AvroSchema = ArrowToAvroSchemaBuilder::new(&arrow_schema)
.with_root_name("User")
.with_namespace("com.example")
.with_doc("Schema inferred from Arrow")
.with_nullability_order(Nullability::NullFirst)
.with_strip_internal_arrow_metadata(true)
.with_type_naming_strategy(TypeNamingStrategy::PathBased)
.with_passthrough_metadata_policy(PassthroughMetadataPolicy::Default)
.build()?;
````
Initial builder "with_" knobs that would help correctness and downstream
use-cases:
* Root record identity:
* `with_root_name(...)` (default: `AVRO_ROOT_RECORD_DEFAULT_NAME` or Arrow
`avro.name`)
* `with_namespace(...)` (default: Arrow `avro.namespace` if present)
* `with_doc(...)` (default: Arrow `avro.doc` if present)
* Nullability + unions:
* `with_nullability_order(Nullability::NullFirst|NullSecond)` (default
`NullFirst`, aligning with Avro union-default constraints)
* Metadata behavior:
* `with_strip_internal_arrow_metadata(bool)` (defaults to current behavior)
* `with_passthrough_metadata_policy(...)` controlling how non-reserved
Arrow metadata becomes Avro attributes (today there is logic for "passthrough
metadata" that could become configurable)
* Naming strategy for generated nested named types (records/enums/fixed):
* `with_type_naming_strategy(...)` to guarantee deterministic and
collision-free nested type names
* (optional) `with_type_name_overrides(...)` for explicit mapping by Arrow
field-path
* Logical/extension type policy:
* Define how Arrow logical/extension types map to Avro logical types, and
what happens when unsupported (error vs fallback encoding)
This builder should be positioned as the explicit advanced inference entry
point, while keeping a simpler defaults path for writer convenience.
**B) Make "use embedded schema" vs "infer schema" explicit**
Today, the Arrow schema metadata key `SCHEMA_METADATA_KEY = "avro.schema"`
can contain the full Avro schema JSON. When present, it is often preferable to
use it verbatim to preserve exact schema identity across OCF/SOE/registry
contexts.
We should make this explicit and stable:
* A clear helper for "use embedded Avro schema if present, else error"
(reader-like behavior)
* A clear helper for "use embedded schema if present, else infer" (writer
convenience)
(Exact API design TBD, but could be builder flags or separate helpers.)
**C) Split `schema.rs` by responsibility (internal refactor)**
`schema.rs` currently mixes multiple concerns. Refactor into a module layout
that preserves the public API but improves maintainability and testability, for
instance:
* `schema::mod`: schema representation + serde + builder (Avro JSON)
* `schema::store`: schema store, canonical form + Rabin/MD5/SHA256
fingerprints
* `schema::metadata`: Arrow schema metadata keys + embed/extract helpers
(`avro.schema`, `avro.name`, `avro.namespace`, `avro.doc`, defaults/enums)
* `schema::infer`: Arrow -> Avro inference logic (used by the builder)
* `schema::project`: Avro-aware projection/pruning utilities (ties into
#8923)
* `schema::evolve`: Avro-aware evolution/extension utilities (also used by
the builder)
* (optional) `schema::compat` / `schema::resolve`: compatibility checks +
clearer error reporting (path + failure reason)
**D) Provide Avro-aware schema projection/evolution primitives**
Centralize Avro schema pruning/projection in `arrow-avro` (rather than
downstream).
This is related to #8923 and would ideally live alongside the refactor so
both "use real schema", "inference", and "builder" paths can share projection
and evolution logic.
**E) Deprecate `AvroSchema::try_from()`**
Deprecate the existing `AvroSchema::try_from()` method and use
`ArrowToAvroSchemaBuilder::new().build()?` in it's place. This shouldn't create
any downstream behavior so long as `ArrowToAvroSchemaBuilder` matches
`AvroSchema::try_from()` when no knobs are used.
**Describe alternatives you've considered**
1. **Continue fixing inference bugs incrementally without refactoring**
* Risks continued complexity growth in `schema.rs` and makes it harder to
reason about correctness across reader/writer/projection paths.
2. **Require callers to always provide Avro schema JSON**
* This removes the writer convenience path and doesn’t address
projection/evolution needs or tests where schemas are partially specified via
Arrow metadata.
3. **Downstream projects implement Avro schema editing themselves**
* This duplicates Avro-specific logic and encourages subtle divergences
from `arrow-avro` behavior, especially around naming, metadata, and resolution.
4. **Expose a single `InferOptions` struct instead of a builder**
* Works initially, but becomes less ergonomic as options grow, and makes
it harder to evolve without breaking call-sites. A builder provides a more
extensible surface.
**Additional context**
* Related issues:
* #8928 (nested named type: type name vs field name mismatch when
generating schemas from Arrow-only metadata)
* #8923 (need Avro-aware projection API in `ReaderBuilder` / centralize
schema editing)
* Relevant constants/metadata (current behavior to preserve where possible):
* `SCHEMA_METADATA_KEY = "avro.schema"`
* `AVRO_NAME_METADATA_KEY = "avro.name"`
* `AVRO_NAMESPACE_METADATA_KEY = "avro.namespace"`
* `AVRO_DOC_METADATA_KEY = "avro.doc"`
* `AVRO_FIELD_DEFAULT_METADATA_KEY = "avro.field.default"`
* `AVRO_ENUM_SYMBOLS_METADATA_KEY = "avro.enum.symbols"`
* Avro spec considerations that influence inference defaults:
* Union default values must match the first union branch, which is why
`["null", T]` is typically preferred for optional fields:
[https://avro.apache.org/docs/1.11.1/specification/#unions](https://avro.apache.org/docs/1.11.1/specification/#unions)
* This issue is intentionally large: the goal is to land a design that
solves the schema limitations in `arrow-avro` in the long-run. This will need
to be implemented either via sub-issues or smaller partial PRs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]