We've been happy doing the second approach you mentioned. Our usage looks like (in avro IDL):
record Datum { Header header; Body body; } where Header contains the meta-data and Body is specific to the particular application. Something like: record Body { union { SpecificType1, SpecificType1, ...} body; } one of the nice side effects is that you can take data written with the composite Datum schema and let Avro transform it into what you need by specifying a different reader's schema (Note: you also still have to give Avro *exactly* the schema the data were originally written with, the "writer's schema", for it to be able to parse the Datum records). So if all you care about is the application-specific part you use the following reader's schema in your parser: record HeaderFreeDatum { Body body; } Conversely, if you care about the header bits use this as the reader's schema in your parser: record BodyFreeDatum { Header header; } In our use we found significant speedup reading just the headers (YMMV). You can also use Avro-generated classes for the BodyFreeDatum that don't really ever change (as long as the Header doesn't change). This lets you revise the schemas for Header and the SpecificTypeX on different schedules. One final piece of advice: think about how you will handle the inevitable evolution the schemas will undergo. ________________________________________ From: Wai Yip Tung <w...@tungwaiyip.info> Sent: Tuesday, April 29, 2014 6:14 PM To: user@avro.apache.org Subject: embed avro data in an envelop I am looking for some avro usage advice. We have created various schema for different applications, say to represent, item id, name, metric, etc. On the other hand our infrastructure group want to include some meta data on all messages. This should include things like timestamp, hostname, etc. This meta data is the same for all application messages. One way to do it is to have a meta data schema that has timestamp, hostname and a binary content field for the application data. This way each message need to be decoded twice using two schema. Another way is to somehow have a composite schema that include both the meta data and the application specific data. So each message is just decoded once and it automatically include the needed meta data. I wonder if this can be done and if it is a good idea. Have other people considered similar usage? Wai Yip