PookieBuns opened a new issue, #449:
URL: https://github.com/apache/avro-rs/issues/449

   Hi maintainers! First of all, I would love to thank you for your work 
bringing apache avro into the Rust ecosystem. I am truly grateful. I have been 
trying to adopt avro more at work, but in its current state, it is quite 
brittle, especially with respect to using serde with Unions that contain a null 
variant. I plan on contributing to fix these issues, but before then I would 
like some input from the maintainers with respect to what we would like to 
support wrt this crate.
   
   # TLDR
   - There are 3 ways to represent avro nullable types in Rust, some work and 
some don't, but I'd like input on which ones we'd like to support
   - For combination nullable unions, the rusty way requires 
`#[serde(untagged)]` to work but it fails on multiple edge cases. Would like 
input on which [plan](#ways-to-address) to address is better direction for this 
crate
   
   # The Rust Representation Conundrum
   
   Given an schema of 
   
   ```
       {
           "name": "MyUnion",
           "type": [
               "null",
               "int"
           ]
       }
   ```
   
   There are currently 3 arguable ways to represent this with Rust types
   
   ## The current `avro-rs` way
   
   ```
   enum MyUnionNullable {
       Null,
       Int(i32),
   }
   ```
   
   ### Advantages
   
   This aligns closest to how avro binary encoding works and is simplest to 
serialize. Use the variant index from serde to properly encode the branch index 
then encode the contained value
   
   ### Disadvantages
   
   This is not "rusty", as null or missing values are generally represented as 
`Option<T>`, and by doing this you lose the nice semantics of `Option`.
   
   This also isn't the proper avro json encoding if you choose to serialize it 
with `serde_json`. 
   
   According to the avro specification,
   
   > The value of a union is encoded in JSON as follows:
   >
   > - if its type is null, then it is encoded as a JSON null;
   > - otherwise it is encoded as a JSON object with one name/value pair whose
     name is the type’s name and whose value is the recursively encoded value.
     For Avro’s named types (record, fixed or enum) the user-specified name is
     used, for other types the type name is used.
   
   Which means that the branch should be encoded as `null` and the int branch 
should be encoded as `{ "int": 1 }`, but in this representation the null branch 
is encoded as `"Null"`.
   
   ## The "Rusty" way
   
   ```
   Option<i32>
   ```
   
   ### Advantages
   
   This is the most intuitive for Rust users
   
   ### Disadvantages
   
   This also isn't the proper avro json encoding if you choose to serialize it 
with `serde_json` because the int branch gets encoded as `1` without the 
discriminator
   
   ## The Anchor On Avro Json Encoding And Serde Json Way
   
   ```
   enum MyUnionAvroJsonEncoding {
       #[serde(rename = "int")]
       Int(i32),
   }
   
   Option<MyUnionAvroJsonEncoding
   ```
   
   ### Advantages
   
   This is technically the most "correct" representation because it serializes 
exactly according to the avro json encoding specification
   
   ### Disadvantages
   
   This is even worse looking than the `avro-rs` way
   
   # Implications
   
   This begs the question of
   
   **What variations of these Rust representations should we support in our 
serializer and deserializer, and is it even possible to support said 
representation?**
   
   As of today, 1 and 2 are fully supported. EXCEPT, for nullable enums, thanks 
to me causing a regression with #337 (Which is unreleased so I guess not so 
bad). Not going to get into the details but basically Enums and Unions are the 
same type in Rust, and with serde you can't tell if you're unit variant is 
serializing an Enum like type or a Union like type. 
   
   IMHO the current plan of only supporting 1 and 2 for 2 variant nullable 
unions is probably sufficient, and avro json encoding should not be done by 
serde json but instead should be exposed by `apache-avro` with its own 
serialization of converting an `apache-avro` value to a `serde_json` value.
   
   # But what about Nullable Combination Unions?
   
   ```
       {
           "name": "MyUnion",
           "type": [
               "null",
               "int",
               {
                   "type": "enum",
                   "name": "MyEnum",
                   "symbols": ["A", "B"]
               },
               {
                   "type": "record",
                   "name": "MyRecord",
                   "fields": [
                       {"name": "a", "type": "int"}
                   ]
               }
           ]
       }
   ```
   There are in my opinion, two ways to represent this in rust
   
   ## The current `avro-rs` way
   
   ```
       enum MyEnum {
           A,
           B,
       }
   
       struct MyRecord {
           a: i32,
       }
   
       enum MyUnionNullable {
           Null,
           Int(i32),
           MyEnum(MyEnum),
           MyRecord(MyRecord),
       }
   ```
   
   ## The "Rusty" way
   ```
       enum MyEnum {
           A,
           B,
       }
   
       struct MyRecord {
           a: i32,
       }
   
       enum MyUnion {
           Int(i32),
           MyEnum(MyEnum),
           MyRecord(MyRecord),
       }
   
       Option<MyUnion>
   ```
   # Implications
   
   My #337 PR was actually aimed to solve this problem. Before said PR, the 
"rusty" way of representing combination unions actually didn't work at all 
since it would immediately pick the first variant of of the schema and try to 
select it. 
   
   HOWEVER, my PR was an mediocre fix at best because in order for it to be 
supported, `#[serde(untagged)]` was required to be decorated on the type, and 
it would break in 2 edge cases.
   
   1. Enums didn't work. This is because of the same problem that I mentioned 
regarding the regression that I created
   2. Because of untagged, deserialization would deserialize to the incorrect 
variant if their fields were identical, or if an earlier variant could "serde" 
deserialize a later variant with its Option semantics, example shown below
   
   ```
       const NULLABLE_RECORD_SCHEMA: &str = r#"
       {
           "name": "MyUnion",
           "type": [
               "null",
               {
                   "type": "record",
                   "name": "MyRecordA",
                   "fields": [
                       {"name": "a", "type": "int"}
                   ]
               },
               {
                   "type": "record",
                   "name": "MyRecordB",
                   "fields": [
                       {"name": "a", "type": "int"}
                   ]
               }
           ]
       }
       "#;
   
       #[derive(Debug, Serialize, Deserialize, PartialEq)]
       struct MyRecordA {
           a: i32,
       }
   
       #[derive(Debug, Serialize, Deserialize, PartialEq)]
       struct MyRecordB {
           a: i32,
       }
   
       #[derive(Debug, Serialize, Deserialize, PartialEq)]
       #[serde(untagged)]
       enum MyUnionUntagged {
           MyRecordA(MyRecordA),
           MyRecordB(MyRecordB),
       }
   ```
   
   In the above example, an avro of variant `MyRecordB` will always deserialize 
to `MyRecordA` because untagged means there's no discriminant and the best 
effort deserialization chooses MyRecordA
   
   ## Ways to address
   
   The goals of my 2 proosals here are to get rid of the reliance on 
#[serde(untagged)] for serialization. I haven't dove into the deserialization 
yet, but rest assured, these plans will not become a PR unless a roundtrip can 
be successful.
   
   1. Lookup Union variants by name rather than index. serde 
serialize_x_variant provides the name of the variant branch you are in. Instead 
of lookup by variant_index, lookup by variant_name to match the schema. Unnamed 
types like primitves in avro will just follow the name as provided by json 
encoding. This change would allow enums to be in whatever order they want, but 
add a restriction that names must strictly follow their primitive name i.e. Int 
String Long. This would be a breaking change since previous users would now 
need to rename their variants to align with the naming convention. Also there 
may be a potential performance hit because we need to scan our union schemas to 
match, although I'm not sure how relevant that is. We could also technically 
only do Lookup by Name if we use the Option<Enum> representation, since we 
would be notified through `serialize_some`
   
   2. Still use index based lookup, BUT make the assumption that if you use 
Option<Enum>, The null variant must be the 0th branch, and properly bump index 
by 1 when you arrive at "serialize_x_variant." However, this would mean that if 
you use Option<Enum>, you are restricted to having a certain schema shape, but 
in the vast majority of cases, null is always the first variant (I recall its 
mentioned in the spec that this is standard practice)
   
   # Final thoughts
   
   `serde` and `avro` don't play the nicest together, but its what we have. My 
goal is for serialization and deserialization to work properly for everyone 
trying to use the crate while trying to follow idiomatic rust as possible. 
Sorry for the long post, but thanks for your attention.
   
   # Appendix
   
   I attached a comprehensive minimal test code that documents the different 
successes and failures of representations of unions with null variant in rust. 
(GH doesn't let me upload an rs file so its in txt)
   
   
[test_nullable_union.txt](https://github.com/user-attachments/files/24978848/test_nullable_union.txt)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to