[PR] Schema parsing and resolved schema rework [avro-rs]

via GitHub Mon, 20 Apr 2026 15:21:49 -0700


KitFieldhouse opened a new pull request, #536:
URL: https://github.com/apache/avro-rs/pull/536


   Hey you all, this PR reworks schema parsing and `ResolvedSchema`.
   
   I apologize for the size of this PR. I tried making it smaller, but I kept 
finding the changes made to parsing/`ResovledSchema` would trickle into many 
other sections of the crate.
   
   This PR addresses issues \#531, \#353, and \#508.
   
   Here is the gist of what this PR does:
   
   # The new stuff
   ## Introduces `SchemaWithSymbols`.
   
   Instead of representing a schema as a single recursive `Schema` variant, 
`SchemaWithSymbols` represents a schema as a hash map of all defined named 
schemata and a hash set of all referenced named schemata from that schema. 
This, coupled with the root schema enum, is enough to represent the entire 
schema parse tree.
   
   The upshot of this representation is that when we go to check if a set of 
schemata form a complete context  we don't need to recurse into each schema and 
can just compare the "symbol tables" directly.
   
   In addition, `SchemaWithSymbols` is essentially just made up of `Arc`'s and 
`HashMaps` of `Arc`'s. This should make cloning fairly quick compared to 
`Schema`.
   
   ## Introduces `ResolvedContext`.
   
   `ResovledContext` wraps a list of schema definitions that have been checked 
for compatibility (no double naming of named schemata, all named references 
have a definition). This is closest to what the original `ResolvedSchema` 
looked like.
   
   ## Completely reworks `ResolvedSchema`.
   
   As can be seen by the diff, `ResolvedSchema` has been essentially replaced 
in its entirety. `ResolvedSchema` now wraps a schema and all of the definitions 
needed to uniquely resolve all named schemata. In addition, we can convert 
`ResolvedSchema` into a `ResolvedNode`.
   
   ## Introduces `ResolvedNode`.
   
   The struct `ResolvedNode` allows other code to walk down a schema tree 
without having to resolve `Schema::Ref` variants themselves. Further, since a 
`ResolvedNode` can only be created from a `ResolvedSchema`, we can make the 
type guarantee that this resolution will always find a unique named schema 
definition. `ResolvedNode` also makes the promise that any default for a record 
field has been checked against the field's schema.
   
   ## Introduces a `Resolver` trait.
   
   This trait is designed to be used to hook up this crate's schema resolution 
logic to an external schema registry. Struct's implementing this trait can be 
provided to `ResolvedContext` either directly or through `ResolvedSchema`. If a 
local definition for a schema can not be found, the `Resolver` is called asking 
for it to either provide that schema or to signal an error.
   
   # API changes
   
   This PR introduces several breaking API changes, both in the public API and 
to the private API. I will try to summarize the changes for both layers.
   
   ## Semantic changes -- parsing
   
   - Parser now always generates a `SchemaWithSymbols`. To keep compatibility 
with existing `Schema` based methods, the original schema parsing methods like 
`Schema::parse_str` will use the parser to generate a `SchemaWithSymbols` which 
is then unraveled into a normal `Schema` enum. This adds some overhead, but 
this is already a "slow" path so I don't think this introduces performance 
concerns.
   
   - `Schema::parse_list` no longer checks if every named schema has a unique 
definition somewhere in the list. This is due to the conceptual shift mentioned 
in issue \#353. In this PR, parser methods do nothing more than parse each 
schema string and check that each one is internally consistent. Checking that a 
list of schemata is consistent (no duplicate named definitions, all references 
resolved) is pushed to `ResolvedSchema`.
   
   ## API changes -- public facing
   
   - `ResolvedSchema` completely reworked. Any code that relied on the old 
`ResolvedSchema` would have to be refactored (though I think the needed changes 
are rather minimal). The new `ResolvedSchema` wraps a single schema and its 
resolved context, and is created via `ResolvedSchema::builder()`. See the new 
builder API: `.additional()` to provide
   extra schemata, then `.build_one()`, `.build_array()`, or `.build_list()` as 
the terminal method depending on how many schemas you need resolved. Variants 
with `_with_resolver` accept a custom `Resolver` implementation for schema 
registry lookups.
   - Elimination of `Schema::parse_str_with_list`. This method parsed a schema 
string alongside a list of other schema strings and returned the parsed schema 
plus the resolved list. The same result can now be achieved by parsing each 
schema independently with `SchemaWithSymbols::parse_str` and then resolving 
them together via 
`ResolvedSchema::builder().additional(others)?.build_one(target)?`.
   - Elimination of `Schema::denormalize`. Denormalization (inlining named type 
definitions into references) is now handled via `ResolvedSchema::unravel()` or 
`SchemaWithSymbols::unravel()`.
   - Elimination of `Schema::independent_canonical_form`. This is now accessed 
through `CompleteSchema::independent_canonical_form()` which requires a 
`ResolvedSchema`, making the resolved-context dependency explicit.
   - `resolve_names` and `resolve_names_with_schemata` (were `pub(crate)`) are 
removed. Their role is subsumed by `ResolvedContext`.
   - `ResolvedOwnedSchema` is removed. `ResolvedSchema` now owns its data (via 
`Arc`), making the separate owned variant unnecessary.
   - `Schema::name()` now returns `Option<&Arc<Name>>` instead of 
`Option<&Name>`.
   - `Value::validate_schemata(schemata)` replaced by 
`Value::validate_against_resolved(resolved_schema)`.
   - `find_schema_with_known_schemata` is removed. Finding a matching element 
in a union requires that you have a `ResolvedSchema`. The method that performs 
union element matching is now `structural_match_on_schema` defined on 
`ResolvedUnion`.
   
   ## API changes -- private facing
   
   - `decode_internal` signature changed from `(schema: &Schema, names: 
&HashMap<Name, S>, enclosing_namespace: NamespaceRef, reader: &mut R)` to
   `(node: ResolvedNode, reader: &mut R)`. The `ResolvedNode` carries the 
schema, its resolved definitions, and handles namespace tracking internally
   - `encode_internal` similarly changed from `(value: &Value, schema: &Schema, 
names: &HashMap<Name, S>, enclosing_namespace: NamespaceRef, writer: &mut W)`
   to `(value: &Value, node: ResolvedNode, writer: &mut W)`.
   - New functions `decode_complete` and `encode_complete` are added as the 
preferred entry points for decoding/encoding with a `ResolvedSchema`.
   - `Block` now takes a `ResolvedSchema` instead of `Names`.
   - `validate_internal` and `resolve_internal` on `Value` changed from generic 
`<S: Borrow<Schema> + Debug>` with `(schema, names, namespace)` parameters to 
taking a `ResolvedNode` directly.
   - The generic `S: Borrow<Schema>` pattern used throughout decode, encode, 
and types for referencing named schemas is entirely replaced by 
`ResolvedNode`'s transparent reference resolution.
   - Elimination of `Schema::parse_with_names` (was `pub(crate)`). Internal 
callers now use `SchemaWithSymbols` and `ResolvedSchema` instead of passing 
`Names` maps through parsing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Schema parsing and resolved schema rework [avro-rs]

Reply via email to