Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Ryan Skraba Thu, 19 Dec 2019 02:11:13 -0800

Hello!  You might be interested in this short discussion on the dev@
mailing list: 
https://lists.apache.org/x/thread.html/dd7a23c303ef045c124050d7eac13356b20551a6a663a79cb8807f41@%3Cdev.avro.apache.org%3E


In short, it appears that the record name is already ignored in
record-to-record matching (at least outside of unions) as an
implementation detail in Java.  I never *did* get around to verifying
the behaviour of the other language implementations, but if this is
what is being done in practice, it's worth clarifying in the
specification.

It does seems like a very pragmatic thing to do, and would help with
the CloudEvents Avro use case.  It would be a nice recipe to share in
the docs: the right way to read an envelope from a custom message when
you don't need the payload.

I'm not sure I understand the third strategy, however!  There aren't
any names in binary data when writing - what would the alias do?

(Also, I largely prefer your avro version with explicitly typed
metadata fields and names as well!)

All my best, Ryan

On Wed, Dec 18, 2019 at 5:49 PM roger peppe <rogpe...@gmail.com> wrote:
>
> Hi,
>
> Background: I've been contemplating the proposed Avro format in the 
> CloudEvent specification, which defines standard metadata for events. It 
> defines a very generic format for an event that allows storage of almost any 
> data. It seems to me that by going in that direction it's losing almost all 
> the advantages of using Avro in the first place. It feels like it's trying to 
> shoehorn a dynamic message format like JSON into the Avro format, where using 
> Avro itself could do so much better.
>
> I'm hoping to propose something better. I had what I thought was a nice idea, 
> but it doesn't quite work, and I thought I'd bring up the subject here and 
> see if anyone had some better ideas.
>
> The schema resolution part of the spec allows a reader to read a schema that 
> was written with extra fields. So, theoretically, we could define a 
> CloudEvent something like this:
>
> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata", 
> "type": { "type": "record", "name": "CloudEvent", "namespace": 
> "avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name": 
> "source", "type": "string" }, { "name": "time", "type": "long", 
> "logicalType": "timestamp-micros" }] } }] }
>
> Theoretically, this could enable any event that's a record that has at least 
> a Metadata field with the above fields to be read generically. The CloudEvent 
> type above could be seen as a structural supertype of all possible 
> more-specific CloudEvent-compatible records that have such a compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the 
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the 
> metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution 
> rule: "both schemas are records with the same (unqualified) name". This means 
> that unless everyone names all their CloudEvent-compatible records 
> "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records "CloudEvent", 
> so we have a problem.
>
> I can see a few possible workarounds:
>
> when reading the record as a CloudEvent, read it with a schema that's the 
> same as CloudEvent, but with the top level record name changed to the top 
> level name of the schema that was used to write the record.
> ignore record names when matching schema record types.
> allow aliases to be specified when writing data as well as reading it. When 
> defining a CloudEvent-compatible event, you'd add a CloudEvent alias to your 
> record.
>
> None of the options are particularly nice. 1 is probably the easiest to do, 
> although means you'd still need some custom logic when decoding records, 
> meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with union 
> types. You could define the matching such that it ignores names only when the 
> two matched types are unambiguous (i.e. only one record in both). This could 
> be implemented as an option ("use structural typing") when decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for 
> example, the canonical schema transformation strips aliases out, but they'd 
> need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better 
> way?
>
>   cheers,
>     rog.
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Reply via email to