KitFieldhouse opened a new issue, #383: URL: https://github.com/apache/avro-rs/issues/383
Hi all! While working on issue https://github.com/apache/avro-rs/issues/353, I have also bumped into some possible issues with how we currently perform writer to reader schema resolution. There are two issues here: - Resolution coercions in this implementation that are not in the specification. - Not using type names in resolution. Currently, in my reading of the code, writer to reader schema resolution is performed by first deserializing some buffer into a `Value` using the "writer" schema and then by calling `resolve` (or `resolve_schemata`) on this value with the reader schema. I think the fact that we are performing schema resolution on the value directly without reference to the original writer schema is central to both of these issues. ## Resolution coercions not in the specification Currently, we are allowed to resolve Value::Map against a Schema::Record and to resolve Value::Long against a Schema::Int. This last resolution rule causes buggy behavior, as seen by this test. ```rust #[test] fn test_long_to_int() -> TestResult { init(); let writer_schema = r#"{ "type": "record", "name": "ASchema", "fields": [ { "name": "aNumber", "type": "long" } ] }"#; let reader_schema = r#"{ "type": "record", "name": "ASchema", "fields": [ { "name": "aNumber", "type": "int" } ] }"#; // value that was written with writer_schema let value = Value::Record(vec![( "aNumber".into(), Value::Long(300000000000) ) ]); let buffer = TestBuffer(Rc::new(RefCell::new(Vec::new())), 0); let writer_schema = Schema::parse_str(writer_schema)?; let mut writer = Writer::new(&writer_schema, buffer.clone())?; let _ = writer.append(value); let _ = writer.flush(); let reader_schema = Schema::parse_str(reader_schema)?; let mut reader = Reader::with_schema(&reader_schema,buffer)?; let read_value = reader.next(); match read_value { Some(v) => { panic!("Reader and writer schema are incompatible, should not be able to deserialize. Deser value: {:?}", v) } None => {} } Ok(()) } ``` Currently, this test fails and the value of the "aNumber" field in the reader schema is a silently overflowed integer. ## Not using type names in resolution The specification's rules for schema resolution indicate that we first compare the names of named types, and if that succeeds, perform resolution recursively on that value. Since `Value` does not include names for named types (fixed, record, and enum), this check is not performed. ## What should we do The fix for the long->int conversion is just to remove this arm from the resolution logic we currently have. However, I think this issue is actually a little deeper. Schema resolution as defined by the specification requires three things, a writer schema, a reader schema and a value written with the writer schema (the value is needed for resolving which union branch of the reader schema to take, if possible). As of now, we are performing schema resolution on just the value itself. Since the value doesn't actually have schema level information like the name of named types, there are schema resolution checks (matching via unqualified name) that are impossible with the current structure of `Value`. Therefore, our resolution feels completely structural. However, my reading of the resolution rules from the specification indicate that the names of named types are important and are used, hence resolution is at least quasi-nominal. ## Proposed solution Reorient `Value` as being a flexible user-facing way of creating Avro values. `Value` should include the minimum amount of information to be verified against a provided schema but nothing more. That means: - Get rid of Value::Union. In my opinion, a union is a schema-level construct describing possible types; a concrete value inhabits a specific branch, not the union itself - Rename Value::Enum as Value::EnumVariant. Enum is the type, variant is the value. - Values of named types (Record, Fixed, Enum) should allow an optional name. If no name is provided, resolution is essentially the same structural resolution we have now. If a name is provided, we first perform resolution by name, and if that succeeds, perform structural resolution. To fulfill the semantic role of a value *of* a given schema, we should create a new type: `TypedValue` that wraps both a `Value` and a schema and verifies that the provided `Value` is indeed a valid value that could have been written with the schema, performing coercion steps if needed (these coercion steps are free to be a superset of the resolution steps listed in the specification). In doing so, `Value` should also be transformed into a new type `CompleteValue` that includes information from the schema its wrapped with, like named type names and union branch numbers. Schema resolution is then a transformation from `TypedValue` of one schema to a `TypedValue` of another by performing exactly the resolution steps listed in the specification. ## To discuss Since this has overlap with the work from https://github.com/apache/avro-rs/issues/353, I was thinking of rolling these two up as one eventual PR. Let me know if the above proposal makes sense. Go Avro! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
