KitFieldhouse opened a new pull request, #536: URL: https://github.com/apache/avro-rs/pull/536
Hey you all, this PR reworks schema parsing and `ResolvedSchema`. I apologize for the size of this PR. I tried making it smaller, but I kept finding the changes made to parsing/`ResovledSchema` would trickle into many other sections of the crate. This PR addresses issues \#531, \#353, and \#508. Here is the gist of what this PR does: # The new stuff ## Introduces `SchemaWithSymbols`. Instead of representing a schema as a single recursive `Schema` variant, `SchemaWithSymbols` represents a schema as a hash map of all defined named schemata and a hash set of all referenced named schemata from that schema. This, coupled with the root schema enum, is enough to represent the entire schema parse tree. The upshot of this representation is that when we go to check if a set of schemata form a complete context we don't need to recurse into each schema and can just compare the "symbol tables" directly. In addition, `SchemaWithSymbols` is essentially just made up of `Arc`'s and `HashMaps` of `Arc`'s. This should make cloning fairly quick compared to `Schema`. ## Introduces `ResolvedContext`. `ResovledContext` wraps a list of schema definitions that have been checked for compatibility (no double naming of named schemata, all named references have a definition). This is closest to what the original `ResolvedSchema` looked like. ## Completely reworks `ResolvedSchema`. As can be seen by the diff, `ResolvedSchema` has been essentially replaced in its entirety. `ResolvedSchema` now wraps a schema and all of the definitions needed to uniquely resolve all named schemata. In addition, we can convert `ResolvedSchema` into a `ResolvedNode`. ## Introduces `ResolvedNode`. The struct `ResolvedNode` allows other code to walk down a schema tree without having to resolve `Schema::Ref` variants themselves. Further, since a `ResolvedNode` can only be created from a `ResolvedSchema`, we can make the type guarantee that this resolution will always find a unique named schema definition. `ResolvedNode` also makes the promise that any default for a record field has been checked against the field's schema. ## Introduces a `Resolver` trait. This trait is designed to be used to hook up this crate's schema resolution logic to an external schema registry. Struct's implementing this trait can be provided to `ResolvedContext` either directly or through `ResolvedSchema`. If a local definition for a schema can not be found, the `Resolver` is called asking for it to either provide that schema or to signal an error. # API changes This PR introduces several breaking API changes, both in the public API and to the private API. I will try to summarize the changes for both layers. ## Semantic changes -- parsing - Parser now always generates a `SchemaWithSymbols`. To keep compatibility with existing `Schema` based methods, the original schema parsing methods like `Schema::parse_str` will use the parser to generate a `SchemaWithSymbols` which is then unraveled into a normal `Schema` enum. This adds some overhead, but this is already a "slow" path so I don't think this introduces performance concerns. - `Schema::parse_list` no longer checks if every named schema has a unique definition somewhere in the list. This is due to the conceptual shift mentioned in issue \#353. In this PR, parser methods do nothing more than parse each schema string and check that each one is internally consistent. Checking that a list of schemata is consistent (no duplicate named definitions, all references resolved) is pushed to `ResolvedSchema`. ## API changes -- public facing - `ResolvedSchema` completely reworked. Any code that relied on the old `ResolvedSchema` would have to be refactored (though I think the needed changes are rather minimal). The new `ResolvedSchema` wraps a single schema and its resolved context, and is created via `ResolvedSchema::builder()`. See the new builder API: `.additional()` to provide extra schemata, then `.build_one()`, `.build_array()`, or `.build_list()` as the terminal method depending on how many schemas you need resolved. Variants with `_with_resolver` accept a custom `Resolver` implementation for schema registry lookups. - Elimination of `Schema::parse_str_with_list`. This method parsed a schema string alongside a list of other schema strings and returned the parsed schema plus the resolved list. The same result can now be achieved by parsing each schema independently with `SchemaWithSymbols::parse_str` and then resolving them together via `ResolvedSchema::builder().additional(others)?.build_one(target)?`. - Elimination of `Schema::denormalize`. Denormalization (inlining named type definitions into references) is now handled via `ResolvedSchema::unravel()` or `SchemaWithSymbols::unravel()`. - Elimination of `Schema::independent_canonical_form`. This is now accessed through `CompleteSchema::independent_canonical_form()` which requires a `ResolvedSchema`, making the resolved-context dependency explicit. - `resolve_names` and `resolve_names_with_schemata` (were `pub(crate)`) are removed. Their role is subsumed by `ResolvedContext`. - `ResolvedOwnedSchema` is removed. `ResolvedSchema` now owns its data (via `Arc`), making the separate owned variant unnecessary. - `Schema::name()` now returns `Option<&Arc<Name>>` instead of `Option<&Name>`. - `Value::validate_schemata(schemata)` replaced by `Value::validate_against_resolved(resolved_schema)`. - `find_schema_with_known_schemata` is removed. Finding a matching element in a union requires that you have a `ResolvedSchema`. The method that performs union element matching is now `structural_match_on_schema` defined on `ResolvedUnion`. ## API changes -- private facing - `decode_internal` signature changed from `(schema: &Schema, names: &HashMap<Name, S>, enclosing_namespace: NamespaceRef, reader: &mut R)` to `(node: ResolvedNode, reader: &mut R)`. The `ResolvedNode` carries the schema, its resolved definitions, and handles namespace tracking internally - `encode_internal` similarly changed from `(value: &Value, schema: &Schema, names: &HashMap<Name, S>, enclosing_namespace: NamespaceRef, writer: &mut W)` to `(value: &Value, node: ResolvedNode, writer: &mut W)`. - New functions `decode_complete` and `encode_complete` are added as the preferred entry points for decoding/encoding with a `ResolvedSchema`. - `Block` now takes a `ResolvedSchema` instead of `Names`. - `validate_internal` and `resolve_internal` on `Value` changed from generic `<S: Borrow<Schema> + Debug>` with `(schema, names, namespace)` parameters to taking a `ResolvedNode` directly. - The generic `S: Borrow<Schema>` pattern used throughout decode, encode, and types for referencing named schemas is entirely replaced by `ResolvedNode`'s transparent reference resolution. - Elimination of `Schema::parse_with_names` (was `pub(crate)`). Internal callers now use `SchemaWithSymbols` and `ResolvedSchema` instead of passing `Names` maps through parsing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
