On Sat, Sep 28, 2019, at 17:49, Magnus Edenhill wrote: > Den mån 23 sep. 2019 kl 14:42 skrev Colin McCabe <cmcc...@apache.org>: > > > On Fri, Sep 20, 2019, at 18:05, Jun Rao wrote: > > > 101. We already use varInt in the message format. I assume that the > > > protocol uses the same varInt representation? > > > > It uses a slightly different varint representation. Basically, the > > difference is that the existing representation uses serpentine encoding to > > make representing negative numbers more efficient, at the cost of making > > positive numbers less efficient. Since tags (and lengths) can't be > > negative, there is no need for serpentine encoding, and we can be more > > efficient without it. > > > > While I don't see anything technically wrong with the proposed custom > varint encoding, it does > come at a price since it prevents client developers from using an existing, > tested, and optimized zigzag varint implementation, > and it makes the Kafka protocol more complex by now having 4 ways to encode > integers. > > I'm not strongly opposed, but unless there is an actual efficiency gain in > using the custom encoding, > I'd see us using the existing zigzag varint encoding instead, or having the > protocol semantics state > that a zero-sized tag value is the same as null, or that omission of a tag > means null. > > //Magnus >
Hi Magnus, I don't really think of this as a custom encoding. Protobuf supports both varints that use zigzag and varints that don't, for example. The call the latter "unsigned varints." Zigzag encoding reserves half of the bit patterns for negative integers, and sometimes you know that something isn't going to be negative-- like a length. There is a big efficiency gain from not using zigzag encoding when we don't need negative numbers. It allows us to encode 2x as many integers before moving to using more bytes. Going from memory without checking the code, it's something like we have to start using a two byte encoding when values get larger than 63 when using zigzag encoding, but we can go up to 127 without it. It's should be simple to implement both variations. Basically the code for implementing varints with zigzag encoding is: > public static void writeVarint(int value, DataOutput out) throws > IOException { > writeUnsignedVarint((value << 1) ^ (value >> 31), out); > } You can zigzag encode an int with "(value << 1) ^ (value >> 31)" And then you just do the normal variable-length encoding. This is not something that needs a lot of testing or optimization because it's just a single bit-munging step that you either do or don't do, like choosing to negative the int or something. Decoding is similar -- just a simple bit manipulation expression that you either do or omit. We can't make omission of a tag mean null because a lot of datatypes in the Kafka protocol just aren't nullable. int8, int16, int32, int64, UUID, etc. Even for datatypes that are nullable, the Kafka protocol lets us choose whether null is even an allowed value, let alone the default. Keep in mind that existing fields can be converted to tagged fields in new message versions, so we have to work with the existing semantics. best, Colin