Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Vance Duncan Sat, 21 Dec 2019 09:09:55 -0800

I suggest naming the timestamp field "timestamp" rather than "time". You
might also want to consider calling it "eventTimestamp", since there will
possibly be the need to distinguish when the event occurred vs. when it was
actually published, due to delays in batching, intermittent downtime, etc.


Also, I suggest considering the addition of traceability metadata, which
for any practical implementation is almost always required. An array of
correlation ID's is great for that. It gives the publishers/subscribers a
way of tracing the events to the external causes. Also possibly an array of
"priorEventIds". This way a full tree of traceability can be established
post facto.

On Wed, Dec 18, 2019 at 11:49 AM roger peppe <rogpe...@gmail.com> wrote:

> Hi,
>
> Background: I've been contemplating the proposed Avro format in the CloudEvent
> specification
> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
> defines standard metadata for events. It defines a very generic format for
> an event that allows storage of almost any data. It seems to me that by
> going in that direction it's losing almost all the advantages of using Avro
> in the first place. It feels like it's trying to shoehorn a dynamic message
> format like JSON into the Avro format, where using Avro itself could do so
> much better.
>
> I'm hoping to propose something better. I had what I thought was a nice
> idea, but it doesn't *quite* work, and I thought I'd bring up the subject
> here and see if anyone had some better ideas.
>
> The schema resolution
> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
> of the spec allows a reader to read a schema that was written with extra
> fields. So, theoretically, we could define a CloudEvent something like this:
>
> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata",
> "type": { "type": "record", "name": "CloudEvent", "namespace": "
> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name":
> "source", "type": "string" }, { "name": "time", "type": "long", "
> logicalType": "timestamp-micros" }] } }] }
>
> Theoretically, this could enable any event that's a record that has *at
> least* a Metadata field with the above fields to be read generically. The
> CloudEvent type above could be seen as a structural supertype of all
> possible more-specific CloudEvent-compatible records that have such a
> compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the
> metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution
> rule: "both schemas are records with the same (unqualified) name". This
> means that unless *everyone* names all their CloudEvent-compatible
> records "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records
> "CloudEvent", so we have a problem.
>
> I can see a few possible workarounds:
>
>    1. when reading the record as a CloudEvent, read it with a schema
>    that's the same as CloudEvent, but with the top level record name changed
>    to the top level name of the schema that was used to write the record.
>    2. ignore record names when matching schema record types.
>    3. allow aliases to be specified when writing data as well as reading
>    it. When defining a CloudEvent-compatible event, you'd add a CloudEvent
>    alias to your record.
>
> None of the options are particularly nice. 1 is probably the easiest to
> do, although means you'd still need some custom logic when decoding
> records, meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with
> union types. You could define the matching such that it ignores names only
> when the two matched types are unambiguous (i.e. only one record in both).
> This could be implemented as an option ("use structural typing") when
> decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for
> example, the canonical schema transformation strips aliases out, but they'd
> need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better
> way?
>
>   cheers,
>     rog.
>
>

-- 
Regards,

Vance Duncan
mailto:dunca...@gmail.com
http://www.linkedin.com/in/VanceDuncan
(904) 553-5582

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Reply via email to