Thanks for the reply, Svante!

What causes the schema normalization to be incomplete? And is that a
problem? As long as the reader can get the schema, it shouldn't matter that
there are duplicates – as long as the differences between the duplicates do
not affect decoding.

Would it make sense to have a spec for how to encode these messages? Maybe
<<fingerprint_type>> <<fingerprint>> <<data>> ? Maybe leave room for a
metadata map as well?

On Thu, Jul 9, 2015 at 12:54 PM Svante Karlsson <svante.karls...@csi.se>
wrote:

> I had the same problem a while ago and for the same reasons as you mention
> we decided to use fingerprints (MD5 hash of the schema), however there are
> some catches here.
>
> First I believe that the normalisation of the schema is incomplete so you
> might end up with different hashes of the same schema.
>
> Second, using a 128 bit integer prepended to both key and values takes
> more space than using 32 bit. Not a big issue for values but for keys this
> doubles our size.
>
> Third, we already started to use confluent's registry as well because of
> the already existing integration with other pieces of infrastructure.
> (camus, bottledwater etc.)
>
> What should be useful given this perspective is a byte or two prepending
> the schema id - defining the registry namespace.
>
> I've added the fingerprint schema registry as a example in the c++ kafka
> library at
>
> https://github.com/bitbouncer/csi-kafka/tree/master/examples/schema-registry
>
>
> We run a couple of those in a mesos cluster and use HAproxy find them.
>
>
> /svante
>
>
> 2015-07-09 10:36 GMT+02:00 Daniel Schierbeck <daniel.schierb...@gmail.com>
> :
>
>> I'm working on a system that will store Avro-encoded messages in Kafka.
>> The system will have both producers and consumers in different languages,
>> including Ruby (not JRuby) and Java.
>>
>> At the moment I'm encoding each message as a data file, which means that
>> the full schema is included in each encoded message. This is obviously
>> suboptimal, but it doesn't seem like there's a standardized format for
>> single-message Avro encodings.
>>
>> I've reviewed Confluent's schema-registry offering, but that seems to be
>> overkill for my needs, and would require me to run and maintain yet another
>> piece of infrastructure. Ideally, I wouldn't have to use anything besides
>> Kafka.
>>
>> Is this something that other people have experience with?
>>
>> I've come up with a scheme that would seem to work well independently of
>> what kind of infrastructure you're using: whenever a writer process is
>> asked to encode a message m with schema s for the first time, it broadcasts
>> (s', s) to a schema registry, where s' is the fingerprint of s. The schema
>> registry in this case can be pluggable, and can be any mechanism that
>> allows different processes to access the schemas. The writer then encodes
>> the message as (s', m), i.e. only includes the schema fingerprint. A
>> reader, when first encountering a message with a schema fingerprint s',
>> looks up s from the schema registry and uses s to decode the message.
>>
>> Here, the concept of a schema registry has been abstracted away and is
>> not tied to the concept of "schema ids" and versions. Furthermore, there
>> are some desirable traits:
>>
>> 1. Schemas are identified by their fingerprints, so there's no need for
>> an external system to issue schema ids.
>> 2. Writing (s', s) pairs is idempotent, so there's no need to coordinate
>> that task. If you've got a system with many writers, you can let all of
>> them broadcast their schemas when they boot or when they need to encode
>> data using the schemas.
>> 3. It would work using a range of different backends for the schema
>> registry. Simple key-value stores would obviously work, but for my case I'd
>> probably want to use Kafka itself. If the schemas are writting to a topic
>> with key-based compaction, where s' is the message key and s is the message
>> value, then Kafka would automatically clean up duplicates over time. This
>> would save me from having to add more pieces to my infrastructure.
>>
>> Has this problem been solved already? If not, would it make sense to
>> define a common "message format" that defined the structure of (s', m)
>> pairs?
>>
>> Cheers,
>> Daniel Schierbeck
>>
>
>

Reply via email to