Kriskras99 commented on issue #365:
URL: https://github.com/apache/avro-rs/issues/365#issuecomment-3730627801
Over the past month I've done a lot of thinking about how to improve the
enum support in Avro and I think I finally have something that can work. It's
something that's workable in both the derive code and the general code, and
remains compatible with current schemas.
# Current state of Rust enum support in Avro
## Plain enums
Plain enums are serialized as Avro enums (only enum support by
`#[derive(AvroSchema)]`):
```rust
pub enum Foo {
A,
B,
C,
}
```
```json
{
"name": "Foo",
"type": "enum",
"symbols": ["A", "B", "C"]
}
```
## Data enums
Data enums are serialized as a record with a discriminator field (Avro enum)
and a value field (Avro union):
```rust
pub struct Bar {
integer: i32,
}
pub enum Foo {
A {
field: String,
},
B(Bar)
}
```
```json
{
"name": "Foo",
"type": "record",
"fields": [
{
"name": "type",
"type": {
"type": "enum",
"symbols": ["A", "B"]
}
},
{
"name": "value",
"type": [
{
"type": "record",
"fields": [
{
"name": "field",
"type": "string"
}
]
},
{
"name": "Bar",
"type": "record",
"fields": [
{
"name": "integer",
"type": "int"
}
]
}
]
}
]
}
```
The advantage of this approach is that it works for enums where multiple
variants of the same type. It does not currently
work with mixed enums as a unit variant will always be encoded as an Avro
enum.
## Options
Options have special support in the encoding logic to always produce a bare
union.
```rust
type Foo = Option<String>;
```
```json
[
"null",
"string"
]
```
# Alternative representations
These alternative representations support mixed enums.
## Bare union
If all types are unique, then a regular union can be used:
```rust
pub enum Foo {
A {
field: String,
},
B(Bar),
C,
}
```
```json
[
{
"type": "record",
"name": "A",
"fields": [
{
"name": "field",
"type": "string"
}
]
},
{
"name": "Bar",
"type": "record",
"fields": [
{
"name": "integer",
"type": "int"
}
]
},
"null"
]
```
Using the variant name as the namespace can prevent collisions for named
schema types, this however doesn't work for
unnamed types.
## Union with a record for every variant
```rust
pub enum Foo {
A {
field: String,
},
B(Bar),
C,
}
```
```json
[
{
"type": "record",
"name": "A",
"fields": [
{
"name": "field",
"type": "string"
}
]
},
{
"type": "record",
"name": "B",
"fields": [
{
"name": "inner",
"type": {
"name": "Bar",
"type": "record",
"fields": [
{
"name": "integer",
"type": "int"
}
]
}
}
]
},
{
"type": "record",
"name": "C",
"fields": [
{
"name": "inner",
"type": "null"
}
]
}
]
```
This representation will always work and is always as efficient when binary
encoded compared to the bare union. However,
the schema definition does become larger inflating the JSON but also the
in-memory representation.
# Proposal
## Deriving
It would be good to broaden the support of enums we can (de)serialize, I
suggest the following schema derive strategy:
1. If it is a plain enum, emit an Avro enum
2. If it is a mixed/data enum, try to emit an Avro union
3. If that fails, emit an Avro record with a `type` and `value` field
## Encoding/decoding
When encoding or decoding we would just look at the schema to see what is
expected. This does mean changing the signature
of `to_value(value: S) -> Result<Value, Error>` to `to_value(value: S,
schema: &Schema) -> Result<Value, Error>`. We could
also put a bound on `S` that it has to implement `AvroSchema`, but that's
not good performance wise as it cannot be cached.
P.S. While thinking about this problem, it occurred to me that it's possible
that the `AvroSchema` derive can produce
invalid schemas (fields with the same name because of a `rename`,
`Option<T>` where `T`'s schema is `null`). Users who
are creating a schema by hand (either completely or generating schemas in
code) can of course also have this issue. I
think it would be a good idea to add a `Schema::validate(&self) ->
Result<(), Error>` function so we can validate the
generated schema in the derive, and users can check their own schemas.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]