Re: [DISCUSS] KIP-82 - Add Record Headers

Nacho Solis Fri, 07 Oct 2016 16:37:30 -0700

On Fri, Oct 7, 2016 at 8:45 AM, Jay Kreps <j...@confluent.io> wrote:

> This discussion has come up a number of times and we've always passed.
>

Hopefully this time the arguments will be convincing enough that Kafka can
decide to do something about it.

> One of things that has helped keep Kafka simple is not adding in new
> abstractions and concepts except when the proposal is really elegant and
> makes things simpler.
>

I completely agree that we want things to be simple and elegant.  This is
exactly what headers provide.

Headers are a clean way to extend the system without sacrificing
performance or elegance. The are modular and backwards compatible.

> Consider three use cases for headers:
>
>
> 
>  1. Kafka-scope: We want to add a feature to Kafka that needs a
>    particular field.
>

This is a _great_ use case for Kafka headers.  Having headers means that
you can have features that are optional.  Features that are slowly deployed
without needing to move everybody from one protocol version to another
protocol version. All clients don't have to change and all brokers don't
have to change.

Without headers you need to parse the messages differently.  With headers
you use the same parser.
I assume I don't need to get into how this makes the system extensible
without requiring others to use the same extensions you have.

>    2. Company-scope: You want to add a header to be shared by everyone in
>    your company.
>

It is completely true that for client-side things you don't need a
architectural header system.  You could just write a wrapper and
encapsulate every message you send. You could achieve end-to-end.  Even if
this end-to-end exists, Kafka currently offers no way to identify the type
of a message (which I wish we could change), so we have to rely on some
magic number to identify the type. Once we have that we can have a header
system.

Avro is useful for encoding schema based systems, but it's not as useful
for modularity and it's not universal. We have a number of use cases that
don't use avro (and don't want to). They want to send binary data, but from
an org perspective still need some structure to be added for accounting,
tracing, auditing, security, etc.   There is some of this data that would
also be useful at the broker side. This is somewhat problematic at this
point (say, using a client side wrapper).

>    3. World-wide scope: You are building a third party tool and want to add
>    some kind of header.
>

I understand that you see 3 as a niche case, trying to build a third party
tool. For us this is being a good community citizen.  Let's say that we
have a plugin for large-message support. If we wanted to make that
available to the community (as good citizens would), we could make our
header module open source and others could re-use it.  Why have to
re-implement something?   The same is true if some company decided to write
a geo-location header and we wanted to use it for some mobile product.  At
this point, it seems that at least a few organizations are looking for a
plugin system and it's likely that they'll have similar requirements.  For
example. it's possible many IoT companies would need similar features, or
maybe the self-driving cars need similar features, etc.  Something that
would benefit a community at large even if it didn't benefit all users.  So
maybe LinkedIn wouldn't care about the self-driving car style features but
we could care about the security features being worked on at BBVA.

>  1. A global registry of numeric keys is super super ugly. This seems
>    silly compared to the Avro (or whatever) header solution which gives
> more
>    compact encoding, rich types, etc.
>

This seems like a perfectly reasonable thing to discuss.  I'm in favor of
this.  Avro is problematic for this, it implies you know the schema in
advance. You can't easily compose things. The richness of the types is a
matter of serialization so this would be a mute point. If you really wanted
avro, you could encode an avro object inside one of the headers and the
total overhead would be small.

Numeric ints as keys are used by many network protocols as an efficient way
to define the type of data carried. They have proven themselves.

As for keeping a registry, this is a simple thing. We already keep multiple
"registries", the Kafka ApiKeys and Error Codes are things we already
maintain.  Not to mention the "registries" of all the config variables.

>    2. Using byte arrays for header values means they aren't really
>    interoperable for case (3). E.g. I can't make a UI that displays
> headers,
>    or allow you to set them in config. To work with third party headers,
> the
>    only case I think this really helps, you need the union of all
>    serialization schemes people have used for any tool.
>

Byte arrays are serialized by the plugin in question.  If you don't have
that plugin (or the code to handle that specific header) then you won't
know what the data is.  The same is true for deserializing a Key or a Value
from a message.
Having said that, there are TLV (which is what the proposed headers are)
visualizers. The major network dump visualizers support them (that is
tcpdump and wireshark).

   3. For case (2) and (3) your key numbers are going to collide like
>    crazy. I don't think a global registry of magic numbers maintained
> either
>    by word of mouth or checking in changes to kafka source is the right
> thing
>    to do.
>

With the current proposal for numbering there is no collision. (2) and (3)
have different key spaces.  However, it's true that there is coordination
needed if you're going to pull code straight off the web and use it without
configuring.  Even in that case, you could rely on hashing as a starting
point.  This is perfectly workable.

>    4. We are introducing a new serialization primitive which makes fields
>    disappear conditional on the contents of other fields. This breaks the
>    whole serialization/schema system we have today.
>

I'm not sure I understand the comment here. Can you elaborate?

>    5. We're adding a hashmap to each record
>

Are you talking about the wire representation or the programmatic
representation?  On the wire you're just adding some fields, just like you
have a "key" field.  For the programmatic representation it's true that you
would have a headers system that looks like a map (though I'm not sure that
it's a hash map specifically). This should be no problem.  If you think
this is too much overhead (which I assume is what your concern is) then you
don't have to use them. There will be no performance penalty.

>    6. This proposes making the ProducerRecord and ConsumerRecord mutable
>    and adding setters and getters (which we try to avoid).
>

I'm not sure what you mean by "mutable".  I'm going to assume you mean
that the class has fields that can be changed.  This is a matter of
deciding how you deal with this from an API perspective.  You will need a
way to add headers to a Record, but there are ways to do this at the time
of constructing it or it can be done in a parallel or wrapper class.  We
can discuss the details.

> For context on LinkedIn: I set up the system there, but it may have changed
> since i left. The header is maintained with the record schemas in the avro
> schema registry and is required for all records. Essentially all messages
> must have a field named "header" of type EventHeader which is itself a
> record schema with a handful of fields (time, host, etc). The header
> follows the same compatibility rules as other avro fields, so it can be
> evolved in a compatible way gradually across apps. Avro is typed and
> doesn't require deserializing the full record to read the header. The
> header information is (timestamp, host, etc) is important and needs to
> propagate into other systems like Hadoop which don't have a concept of
> headers for records, so I doubt it could move out of the value in any case.
> Not allowing teams to chose a data format other than avro was considered a
> feature, not a bug, since the whole point was to be able to share data,
> which doesn't work if every team chooses their own format.
>

We do have a few cases that do not use avro and would like to keep it that
way.

What is the current way (or the best way if there are multiple) to enforce
messages to a topic are avro (or for that matter, any type)?

If you were still here maybe you would also be in favor of headers now :-).

Nacho

> On Thu, Sep 22, 2016 at 12:31 PM, Michael Pearce <michael.pea...@ig.com>
> wrote:
>
> > Hi All,
> >
> >
> > I would like to discuss the following KIP proposal:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 82+-+Add+Record+Headers
> >
> >
> >
> > I have some initial ?drafts of roughly the changes that would be needed.
> > This is no where finalized and look forward to the discussion especially
> as
> > some bits I'm personally in two minds about.
> >
> > https://github.com/michaelandrepearce/kafka/tree/
> kafka-headers-properties
> >
> >
> >
> > Here is a link to a alternative option mentioned in the kip but one i
> > would personally would discard (disadvantages mentioned in kip)
> >
> > https://github.com/michaelandrepearce/kafka/tree/kafka-headers-full?
> >
> >
> > Thanks
> >
> > Mike
> >
> >
> >
> >
> >
> > The information contained in this email is strictly confidential and for
> > the use of the addressee only, unless otherwise indicated. If you are not
> > the intended recipient, please do not read, copy, use or disclose to
> others
> > this message or any attachment. Please also notify the sender by replying
> > to this email or by telephone (+44(020 7896 0011) and then delete the
> email
> > and any copies of it. Opinions, conclusion (etc) that do not relate to
> the
> > official business of this company shall be understood as neither given
> nor
> > endorsed by it. IG is a trading name of IG Markets Limited (a company
> > registered in England and Wales, company number 04008957) and IG Index
> > Limited (a company registered in England and Wales, company number
> > 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill,
> > London EC4R 2YA. Both IG Markets Limited (register number 195355) and IG
> > Index Limited (register number 114059) are authorised and regulated by
> the
> > Financial Conduct Authority.
> >
>

-- 
Nacho (Ignacio) Solis
Kafka
nso...@linkedin.com

Re: [DISCUSS] KIP-82 - Add Record Headers

Reply via email to