Unsubscribe On Sat, Dec 28, 2019 at 11:29 roger peppe <rogpe...@gmail.com> wrote:
> > > On Sat, 21 Dec 2019, 17:09 Vance Duncan, <dunca...@gmail.com> wrote: > >> I suggest naming the timestamp field "timestamp" rather than "time". You >> might also want to consider calling it "eventTimestamp", since there will >> possibly be the need to distinguish when the event occurred vs. when it was >> actually published, due to delays in batching, intermittent downtime, etc. >> >> Also, I suggest considering the addition of traceability metadata, which >> for any practical implementation is almost always required. An array of >> correlation ID's is great for that. It gives the publishers/subscribers a >> way of tracing the events to the external causes. Also possibly an array of >> "priorEventIds". This way a full tree of traceability can be established >> post facto. >> > > Your suggestions sound good, but I'm unfortunately not in a position to > define those things at this time - the existing CloudEvent specification > defines names and semantics for those fields already (see > https://github.com/cloudevents/spec/blob/v1.0/spec.md) > > I am just trying to define a reasonable way of idiomatically encapsulating > those existing CloudEvent semantics within the Avro format. > > (You might notice that I omitted some fields which are arguably redundant > when one knows the writer's schema, eg. data content type and data schema). > > cheers, > rog. > > >> On Wed, Dec 18, 2019 at 11:49 AM roger peppe <rogpe...@gmail.com> wrote: >> >>> Hi, >>> >>> Background: I've been contemplating the proposed Avro format in the >>> CloudEvent >>> specification >>> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which >>> defines standard metadata for events. It defines a very generic format for >>> an event that allows storage of almost any data. It seems to me that by >>> going in that direction it's losing almost all the advantages of using Avro >>> in the first place. It feels like it's trying to shoehorn a dynamic message >>> format like JSON into the Avro format, where using Avro itself could do so >>> much better. >>> >>> I'm hoping to propose something better. I had what I thought was a nice >>> idea, but it doesn't *quite* work, and I thought I'd bring up the >>> subject here and see if anyone had some better ideas. >>> >>> The schema resolution >>> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part >>> of the spec allows a reader to read a schema that was written with extra >>> fields. So, theoretically, we could define a CloudEvent something like this: >>> >>> { "name": "CloudEvent", "type": "record", "fields": [{ "name": >>> "Metadata", "type": { "type": "record", "name": "CloudEvent", "namespace": >>> "avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { " >>> name": "source", "type": "string" }, { "name": "time", "type": "long", " >>> logicalType": "timestamp-micros" }] } }] } >>> >>> Theoretically, this could enable any event that's a record that has *at >>> least* a Metadata field with the above fields to be read generically. >>> The CloudEvent type above could be seen as a structural supertype of all >>> possible more-specific CloudEvent-compatible records that have such a >>> compatible field. >>> >>> This has a few nice advantages: >>> - there's no need for any wrapping of payload data. >>> - the CloudEvent type can evolve over time like any other Avro type. >>> - all the data message fields are immediately available alongside the >>> metadata. >>> - there's still exactly one schema for a topic, encapsulating both the >>> metadata and the payload. >>> >>> However, this idea fails because of one problem - this schema resolution >>> rule: "both schemas are records with the same (unqualified) name". This >>> means that unless *everyone* names all their CloudEvent-compatible >>> records "CloudEvent", they can't be read like this. >>> >>> I don't think people will be willing to name all their records >>> "CloudEvent", so we have a problem. >>> >>> I can see a few possible workarounds: >>> >>> 1. when reading the record as a CloudEvent, read it with a schema >>> that's the same as CloudEvent, but with the top level record name changed >>> to the top level name of the schema that was used to write the record. >>> 2. ignore record names when matching schema record types. >>> 3. allow aliases to be specified when writing data as well as >>> reading it. When defining a CloudEvent-compatible event, you'd add a >>> CloudEvent alias to your record. >>> >>> None of the options are particularly nice. 1 is probably the easiest to >>> do, although means you'd still need some custom logic when decoding >>> records, meaning you couldn't use stock decoders. >>> >>> I like the idea of 2, although it gets a bit tricky when dealing with >>> union types. You could define the matching such that it ignores names only >>> when the two matched types are unambiguous (i.e. only one record in both). >>> This could be implemented as an option ("use structural typing") when >>> decoding. >>> >>> 3 is probably cleanest but interacts significantly with the spec (for >>> example, the canonical schema transformation strips aliases out, but they'd >>> need to be retained). >>> >>> Any thoughts? Is this a silly thing to be contemplating? Is there a >>> better way? >>> >>> cheers, >>> rog. >>> >>> >> >> -- >> Regards, >> >> Vance Duncan >> mailto:dunca...@gmail.com >> http://www.linkedin.com/in/VanceDuncan >> (904) 553-5582 >> >