Thanks for the reply, Svante! What causes the schema normalization to be incomplete? And is that a problem? As long as the reader can get the schema, it shouldn't matter that there are duplicates – as long as the differences between the duplicates do not affect decoding.
Would it make sense to have a spec for how to encode these messages? Maybe <<fingerprint_type>> <<fingerprint>> <<data>> ? Maybe leave room for a metadata map as well? On Thu, Jul 9, 2015 at 12:54 PM Svante Karlsson <svante.karls...@csi.se> wrote: > I had the same problem a while ago and for the same reasons as you mention > we decided to use fingerprints (MD5 hash of the schema), however there are > some catches here. > > First I believe that the normalisation of the schema is incomplete so you > might end up with different hashes of the same schema. > > Second, using a 128 bit integer prepended to both key and values takes > more space than using 32 bit. Not a big issue for values but for keys this > doubles our size. > > Third, we already started to use confluent's registry as well because of > the already existing integration with other pieces of infrastructure. > (camus, bottledwater etc.) > > What should be useful given this perspective is a byte or two prepending > the schema id - defining the registry namespace. > > I've added the fingerprint schema registry as a example in the c++ kafka > library at > > https://github.com/bitbouncer/csi-kafka/tree/master/examples/schema-registry > > > We run a couple of those in a mesos cluster and use HAproxy find them. > > > /svante > > > 2015-07-09 10:36 GMT+02:00 Daniel Schierbeck <daniel.schierb...@gmail.com> > : > >> I'm working on a system that will store Avro-encoded messages in Kafka. >> The system will have both producers and consumers in different languages, >> including Ruby (not JRuby) and Java. >> >> At the moment I'm encoding each message as a data file, which means that >> the full schema is included in each encoded message. This is obviously >> suboptimal, but it doesn't seem like there's a standardized format for >> single-message Avro encodings. >> >> I've reviewed Confluent's schema-registry offering, but that seems to be >> overkill for my needs, and would require me to run and maintain yet another >> piece of infrastructure. Ideally, I wouldn't have to use anything besides >> Kafka. >> >> Is this something that other people have experience with? >> >> I've come up with a scheme that would seem to work well independently of >> what kind of infrastructure you're using: whenever a writer process is >> asked to encode a message m with schema s for the first time, it broadcasts >> (s', s) to a schema registry, where s' is the fingerprint of s. The schema >> registry in this case can be pluggable, and can be any mechanism that >> allows different processes to access the schemas. The writer then encodes >> the message as (s', m), i.e. only includes the schema fingerprint. A >> reader, when first encountering a message with a schema fingerprint s', >> looks up s from the schema registry and uses s to decode the message. >> >> Here, the concept of a schema registry has been abstracted away and is >> not tied to the concept of "schema ids" and versions. Furthermore, there >> are some desirable traits: >> >> 1. Schemas are identified by their fingerprints, so there's no need for >> an external system to issue schema ids. >> 2. Writing (s', s) pairs is idempotent, so there's no need to coordinate >> that task. If you've got a system with many writers, you can let all of >> them broadcast their schemas when they boot or when they need to encode >> data using the schemas. >> 3. It would work using a range of different backends for the schema >> registry. Simple key-value stores would obviously work, but for my case I'd >> probably want to use Kafka itself. If the schemas are writting to a topic >> with key-based compaction, where s' is the message key and s is the message >> value, then Kafka would automatically clean up duplicates over time. This >> would save me from having to add more pieces to my infrastructure. >> >> Has this problem been solved already? If not, would it make sense to >> define a common "message format" that defined the structure of (s', m) >> pairs? >> >> Cheers, >> Daniel Schierbeck >> > >