Re: [VOTE] KIP-482: The Kafka Protocol should Support Optional Tagged Fields

Colin McCabe Mon, 30 Sep 2019 09:45:30 -0700

On Sat, Sep 28, 2019, at 17:49, Magnus Edenhill wrote:
> Den mån 23 sep. 2019 kl 14:42 skrev Colin McCabe <cmcc...@apache.org>:
> 
> > On Fri, Sep 20, 2019, at 18:05, Jun Rao wrote:
> > > 101. We already use varInt in the message format. I assume that the
> > > protocol uses the same varInt representation?
> >
> > It uses a slightly different varint representation.  Basically, the
> > difference is that the existing representation uses serpentine encoding to
> > make representing negative numbers more efficient, at the cost of making
> > positive numbers less efficient.  Since tags (and lengths) can't be
> > negative, there is no need for serpentine encoding, and we can be more
> > efficient without it.
> >
> 
> While I don't see anything technically wrong with the proposed custom
> varint encoding, it does
> come at a price since it prevents client developers from using an existing,
> tested, and optimized zigzag varint implementation,
> and it makes the Kafka protocol more complex by now having 4 ways to encode
> integers.
> 
> I'm not strongly opposed, but unless there is an actual efficiency gain in
> using the custom encoding,
> I'd see us using the existing zigzag varint encoding instead, or having the
> protocol semantics state
> that a zero-sized tag value is the same as null, or that omission of a tag
> means null.
> 
> //Magnus
>


Hi Magnus,

I don't really think of this as a custom encoding.  Protobuf supports both 
varints that use zigzag and varints that don't, for example.  The call the 
latter "unsigned varints."  Zigzag encoding reserves half of the bit patterns 
for negative integers, and sometimes you know that something isn't going to be 
negative-- like a length.

There is a big efficiency gain from not using zigzag encoding when we don't 
need negative numbers.  It allows us to encode 2x as many integers before 
moving to using more bytes.  Going from memory without checking the code, it's 
something like we have to start using a two byte encoding when values get 
larger than 63 when using zigzag encoding, but we can go up to 127 without it.

It's should be simple to implement both variations.  Basically the code for 
implementing varints with zigzag encoding is:

>    public static void writeVarint(int value, DataOutput out) throws 
> IOException {
>        writeUnsignedVarint((value << 1) ^ (value >> 31), out);
>    }

You can zigzag encode an int with "(value << 1) ^ (value >> 31)"  And then you 
just do the normal variable-length encoding.  This is not something that needs 
a lot of testing or optimization because it's just a single bit-munging step 
that you either do or don't do, like choosing to negative the int or something. 
 Decoding is similar -- just a simple bit manipulation expression that you 
either do or omit.

We can't make omission of a tag mean null because a lot of datatypes in the 
Kafka protocol just aren't nullable.  int8, int16, int32, int64, UUID, etc.  
Even for datatypes that are nullable, the Kafka protocol lets us choose whether 
null is even an allowed value, let alone the default.  Keep in mind that 
existing fields can be converted to tagged fields in new message versions, so 
we have to work with the existing semantics.

best,
Colin

Re: [VOTE] KIP-482: The Kafka Protocol should Support Optional Tagged Fields

Reply via email to