Hi Roger, have you considered leveraging avro logical types, and keep the payload and event metadata “separate”?
Here is a example (will use avro idl, since that is more readable to me :-) ): record MetaData { @logicalType(“instant") string timeStamp; ….. all the meta data fields... } record CloudEvent { MetaData metaData; Any payload; } @logicalType(“any") record Any { /** here you have the schema of the data, for efficiency, you can use a schema id + schema repo, or something like https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences <https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences> */ string schema; bytes data; } this way a system that is interested in the metadata does not even have to deserialize the payload…. hope it helps. —Z > On Dec 18, 2019, at 11:49 AM, roger peppe <rogpe...@gmail.com> wrote: > > Hi, > > Background: I've been contemplating the proposed Avro format in the > CloudEvent specification > <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which > defines standard metadata for events. It defines a very generic format for an > event that allows storage of almost any data. It seems to me that by going in > that direction it's losing almost all the advantages of using Avro in the > first place. It feels like it's trying to shoehorn a dynamic message format > like JSON into the Avro format, where using Avro itself could do so much > better. > > I'm hoping to propose something better. I had what I thought was a nice idea, > but it doesn't quite work, and I thought I'd bring up the subject here and > see if anyone had some better ideas. > > The schema resolution > <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part of > the spec allows a reader to read a schema that was written with extra fields. > So, theoretically, we could define a CloudEvent something like this: > > { > "name": "CloudEvent", > "type": "record", > "fields": [{ > "name": "Metadata", > "type": { > "type": "record", > "name": "CloudEvent", > "namespace": "avro.apache.org <http://avro.apache.org/>", > "fields": [{ > "name": "id", > "type": "string" > }, { > "name": "source", > "type": "string" > }, { > "name": "time", > "type": "long", > "logicalType": "timestamp-micros" > }] > } > }] > } > > Theoretically, this could enable any event that's a record that has at least > a Metadata field with the above fields to be read generically. The CloudEvent > type above could be seen as a structural supertype of all possible > more-specific CloudEvent-compatible records that have such a compatible field. > > This has a few nice advantages: > - there's no need for any wrapping of payload data. > - the CloudEvent type can evolve over time like any other Avro type. > - all the data message fields are immediately available alongside the > metadata. > - there's still exactly one schema for a topic, encapsulating both the > metadata and the payload. > > However, this idea fails because of one problem - this schema resolution > rule: "both schemas are records with the same (unqualified) name". This means > that unless everyone names all their CloudEvent-compatible records > "CloudEvent", they can't be read like this. > > I don't think people will be willing to name all their records "CloudEvent", > so we have a problem. > > I can see a few possible workarounds: > when reading the record as a CloudEvent, read it with a schema that's the > same as CloudEvent, but with the top level record name changed to the top > level name of the schema that was used to write the record. > ignore record names when matching schema record types. > allow aliases to be specified when writing data as well as reading it. When > defining a CloudEvent-compatible event, you'd add a CloudEvent alias to your > record. > None of the options are particularly nice. 1 is probably the easiest to do, > although means you'd still need some custom logic when decoding records, > meaning you couldn't use stock decoders. > > I like the idea of 2, although it gets a bit tricky when dealing with union > types. You could define the matching such that it ignores names only when the > two matched types are unambiguous (i.e. only one record in both). This could > be implemented as an option ("use structural typing") when decoding. > > 3 is probably cleanest but interacts significantly with the spec (for > example, the canonical schema transformation strips aliases out, but they'd > need to be retained). > > Any thoughts? Is this a silly thing to be contemplating? Is there a better > way? > > cheers, > rog. >