Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Michael Burr Sat, 28 Dec 2019 08:37:26 -0800

Unsubscribe

On Sat, Dec 28, 2019 at 11:29 roger peppe <rogpe...@gmail.com> wrote:


>
>
> On Sat, 21 Dec 2019, 17:09 Vance Duncan, <dunca...@gmail.com> wrote:
>
>> I suggest naming the timestamp field "timestamp" rather than "time". You
>> might also want to consider calling it "eventTimestamp", since there will
>> possibly be the need to distinguish when the event occurred vs. when it was
>> actually published, due to delays in batching, intermittent downtime, etc.
>>
>> Also, I suggest considering the addition of traceability metadata, which
>> for any practical implementation is almost always required. An array of
>> correlation ID's is great for that. It gives the publishers/subscribers a
>> way of tracing the events to the external causes. Also possibly an array of
>> "priorEventIds". This way a full tree of traceability can be established
>> post facto.
>>
>
> Your suggestions sound good, but I'm unfortunately not in a position to
> define those things at this time - the existing CloudEvent specification
> defines names and semantics for those fields already (see
> https://github.com/cloudevents/spec/blob/v1.0/spec.md)
>
> I am just trying to define a reasonable way of idiomatically encapsulating
> those existing CloudEvent semantics within the Avro format.
>
> (You might notice that I omitted some fields which are arguably redundant
> when one knows the writer's schema, eg. data content type and data schema).
>
>   cheers,
>     rog.
>
>
>> On Wed, Dec 18, 2019 at 11:49 AM roger peppe <rogpe...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Background: I've been contemplating the proposed Avro format in the 
>>> CloudEvent
>>> specification
>>> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
>>> defines standard metadata for events. It defines a very generic format for
>>> an event that allows storage of almost any data. It seems to me that by
>>> going in that direction it's losing almost all the advantages of using Avro
>>> in the first place. It feels like it's trying to shoehorn a dynamic message
>>> format like JSON into the Avro format, where using Avro itself could do so
>>> much better.
>>>
>>> I'm hoping to propose something better. I had what I thought was a nice
>>> idea, but it doesn't *quite* work, and I thought I'd bring up the
>>> subject here and see if anyone had some better ideas.
>>>
>>> The schema resolution
>>> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
>>> of the spec allows a reader to read a schema that was written with extra
>>> fields. So, theoretically, we could define a CloudEvent something like this:
>>>
>>> { "name": "CloudEvent", "type": "record", "fields": [{ "name":
>>> "Metadata", "type": { "type": "record", "name": "CloudEvent", "namespace":
>>> "avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "
>>> name": "source", "type": "string" }, { "name": "time", "type": "long", "
>>> logicalType": "timestamp-micros" }] } }] }
>>>
>>> Theoretically, this could enable any event that's a record that has *at
>>> least* a Metadata field with the above fields to be read generically.
>>> The CloudEvent type above could be seen as a structural supertype of all
>>> possible more-specific CloudEvent-compatible records that have such a
>>> compatible field.
>>>
>>> This has a few nice advantages:
>>> - there's no need for any wrapping of payload data.
>>> - the CloudEvent type can evolve over time like any other Avro type.
>>> - all the data message fields are immediately available alongside the
>>> metadata.
>>> - there's still exactly one schema for a topic, encapsulating both the
>>> metadata and the payload.
>>>
>>> However, this idea fails because of one problem - this schema resolution
>>> rule: "both schemas are records with the same (unqualified) name". This
>>> means that unless *everyone* names all their CloudEvent-compatible
>>> records "CloudEvent", they can't be read like this.
>>>
>>> I don't think people will be willing to name all their records
>>> "CloudEvent", so we have a problem.
>>>
>>> I can see a few possible workarounds:
>>>
>>>    1. when reading the record as a CloudEvent, read it with a schema
>>>    that's the same as CloudEvent, but with the top level record name changed
>>>    to the top level name of the schema that was used to write the record.
>>>    2. ignore record names when matching schema record types.
>>>    3. allow aliases to be specified when writing data as well as
>>>    reading it. When defining a CloudEvent-compatible event, you'd add a
>>>    CloudEvent alias to your record.
>>>
>>> None of the options are particularly nice. 1 is probably the easiest to
>>> do, although means you'd still need some custom logic when decoding
>>> records, meaning you couldn't use stock decoders.
>>>
>>> I like the idea of 2, although it gets a bit tricky when dealing with
>>> union types. You could define the matching such that it ignores names only
>>> when the two matched types are unambiguous (i.e. only one record in both).
>>> This could be implemented as an option ("use structural typing") when
>>> decoding.
>>>
>>> 3 is probably cleanest but interacts significantly with the spec (for
>>> example, the canonical schema transformation strips aliases out, but they'd
>>> need to be retained).
>>>
>>> Any thoughts? Is this a silly thing to be contemplating? Is there a
>>> better way?
>>>
>>>   cheers,
>>>     rog.
>>>
>>>
>>
>> --
>> Regards,
>>
>> Vance Duncan
>> mailto:dunca...@gmail.com
>> http://www.linkedin.com/in/VanceDuncan
>> (904) 553-5582
>>
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Reply via email to