Hello, We have a integration project we are working on currently that has a lot varying usage/requirements around serialization format. While, we typically have a leaning towards Avro we are now considering using multiple serialization formats (Avro, Protobuf, and JSON) to deal with some of the challenges we are facing. I am interested in hearing if we have any bad assumptions here or if others have considered similar tradeoffs.
In brief: We are building a on-boarding solution that allows customers to upload their data to our platform. Our platform has many products associated with it, each of which can subscribe to customer feeds through pipeline services. In terms of technology we are using HBase for storage and Kafka as the means of brokering between various producers & consumers, i.e. pipeline services. We started with Avro because a) we had prior experience with it b) due to its tag less nature it offers the most efficiency wrt storage and c) strong usage/community with Kafka. Furthermore, we have found the reader/writer separation to be a strong fix for ingesting data and pushing to the pipeline. However, there are several problem areas we hit with Avro. 1) The Registry Lookup - The reader/writer relationship while useful also has a cost. It requires that the schema either be stored with reach event sent in the pipeline or that we pass along an identifier that can be used to look it up/cache the schema. While we have no issue doing the latter within the on-boarding component itself because its isolated, we are more hesitant to place the same cost requirements on downstream components in the pipeline. The ideal here would be for Avro to have an option for tagged output, that wasn't as heavy as placing the entire schema in the message. 2) Variable Data Structures - A goal we have is to easily generate cross-platform libraries to produce/consume messages in the pipeline. While some structures that we know at build time (e.g. MessageHeader) we can generate code for. We also have schemas with a high degree of variability. For example, the schemas for customer datasources we expect to change often and to vary significantly between customers. There doesn't seem to be a way to use Specific Records and Generic Records together. In our case part of the Record is Specific (i.e. well known at build time), but it also contains Generic data (variable structures). There is not a way to specify a a schema that contains a typed generic field (We would like to be able to generate Specific Classes that had the ability to return GenericRecords for a field). As a result we needed to store the variable structures as blobs and write platform specific code to encode/decode. 3) Avro's JSON - To deal with 2, we attempted to use Avro's JSONEncoder and allow clients to parse themselves, however we found the encoding to be Avro specific (assumed an Avro JSONDecoder on the other end). e.g. Unions resulted in an additional object wrapper with type info (even for simple unions with null). Also, default values are also not encoded into the JSON, since those values are derived from the schema. Our project is still in the early stages and our preference would be to use Avro for everything. But due to 1-3 I believe we are headed for a best-of-breed serialization solution, where we are using Avro for ingests/storage, Protobuf for the Pipeline/Kafka to eliminate the need for registry lookups, and JSON for the variable data structures with our own encoding. Thanks, Shone Sadler
