KitFieldhouse opened a new issue, #383:
URL: https://github.com/apache/avro-rs/issues/383

   Hi all! 
   
   While working on issue https://github.com/apache/avro-rs/issues/353, I have 
also bumped into some possible issues with how we currently perform writer to 
reader schema resolution. 
   
   There are two issues here:
   
   - Resolution coercions in this implementation that are not in the 
specification.
   - Not using type names in resolution.
   
   Currently, in my reading of the code, writer to reader schema resolution is 
performed by first deserializing some buffer into a `Value` using the "writer" 
schema and then by calling `resolve` (or `resolve_schemata`) on this value with 
the reader schema. I think the fact that we are performing schema resolution on 
the value directly without reference to the original writer schema is central 
to both of these issues.
   
   ## Resolution coercions not in the specification
   Currently, we are allowed to resolve Value::Map against a Schema::Record and 
to resolve Value::Long against a Schema::Int. This last resolution rule causes 
buggy behavior, as seen by this test.
   
   ```rust
   #[test]
   fn test_long_to_int() -> TestResult {
       init();
       let writer_schema = r#"{
           "type": "record",
           "name": "ASchema",
           "fields": [
           {
               "name": "aNumber",
               "type": "long"
           }
           ]
       }"#;
   
       let reader_schema = r#"{
           "type": "record",
           "name": "ASchema",
           "fields": [
           {
               "name": "aNumber",
               "type": "int"
           }
           ]
       }"#;
   
       // value that was written with writer_schema
       let value = Value::Record(vec![(
               "aNumber".into(),
               Value::Long(300000000000)
       )
       ]);
   
       let buffer = TestBuffer(Rc::new(RefCell::new(Vec::new())), 0);
   
       let writer_schema = Schema::parse_str(writer_schema)?;
   
       let mut writer = Writer::new(&writer_schema, buffer.clone())?;
   
       let _ = writer.append(value);
       let _ = writer.flush();
   
       let reader_schema = Schema::parse_str(reader_schema)?;
       let mut reader = Reader::with_schema(&reader_schema,buffer)?;
   
       let read_value = reader.next();
   
       match read_value {
           Some(v) => {
               panic!("Reader and writer schema are incompatible, should not be 
able to deserialize. Deser value: {:?}", v)
           }
           None => {}
       }
       Ok(())
   }
   ```
   
   Currently, this test fails and the value of the "aNumber" field in the 
reader schema is a silently overflowed integer.
   
   ## Not using type names in resolution
   The specification's rules for schema resolution indicate that we first 
compare the names of named types, and if that succeeds, perform resolution 
recursively on that value. Since `Value` does not include names for named types 
(fixed, record, and enum), this check is not performed.
   
   ## What should we do
   
   The fix for the long->int conversion is just to remove this arm from the 
resolution logic we currently have. However, I think this issue is actually a 
little deeper.
   
   Schema resolution as defined by the specification requires three things, a 
writer schema, a reader schema and a value written with the writer schema (the 
value is needed for resolving which union branch of the reader schema to take, 
if possible). 
   
   As of now, we are performing schema resolution on just the value itself. 
Since the value doesn't actually have schema level information like the name of 
named types, there are schema resolution checks (matching via unqualified name) 
that are impossible with the current structure of `Value`. Therefore, our 
resolution feels completely structural. However, my reading of the resolution 
rules from the specification indicate that the names of named types are 
important and are used, hence resolution is at least quasi-nominal.
   
   ## Proposed solution
   Reorient `Value` as being a flexible user-facing way of creating Avro 
values. `Value` should include the minimum amount of information to be verified 
against a provided schema but nothing more. That means:
   
   - Get rid of Value::Union. In my opinion, a union is a schema-level 
construct describing possible types; a concrete value inhabits a specific 
branch, not the union itself
   - Rename Value::Enum as Value::EnumVariant. Enum is the type, variant is the 
value. 
   - Values of named types (Record, Fixed, Enum) should allow an optional name. 
If no name is provided, resolution is essentially the same structural 
resolution we have now. If a name is provided, we first perform resolution by 
name, and if that succeeds, perform structural resolution.
   
   To fulfill the semantic role of a value *of* a given schema, we should 
create a new type: `TypedValue` that wraps both a `Value` and a schema and 
verifies that the provided `Value` is indeed a valid value that could have been 
written with the schema, performing coercion steps if needed (these coercion 
steps are free to be a superset of the resolution steps listed in the 
specification). In doing so, `Value` should also be transformed into a new type 
`CompleteValue` that includes information from the schema its wrapped with, 
like named type names and union branch numbers.
   
   Schema resolution is then a transformation from `TypedValue` of one schema 
to a `TypedValue` of another by performing exactly the resolution steps listed 
in the specification. 
   
   ## To discuss 
   
   Since this has overlap with the work from 
https://github.com/apache/avro-rs/issues/353, I was thinking of rolling these 
two up as one eventual PR. Let me know if the above proposal makes sense. 
   
   Go Avro!
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to