On 8/4/21 7:21 PM, Tomas Vondra wrote: > On 8/5/21 1:05 AM, Andres Freund wrote: > >> >>> The first one seems quite efficient in how it encodes the length >>> into very >>> few bits (which matters especially for small values). It's designed for >>> integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary >>> lengths >>> fairly easily, I think: >> >>> Look at the first byte, and >>> >>> 0 - 243 - encoded as is >>> 244 - 1 byte >>> 245 - 2 bytes >>> 246 - 3 bytes >>> 247 - 4 bytes >>> 248 - 5 bytes >>> 249 - 6 bytes >>> 250 - 7 bytes >>> 251 - 8 bytes >>> 252 - next 1 byte is length >>> 253 - next 2 bytes are length >>> 254 - next 3 bytes are length >>> 255 - next 4 bytes are length >> >>> If we want to support longer lengths, we'd have to reserve an extra >>> value >>> (which reduces the number of values that require a single byte). >> >> I think that's not a bad scheme. I think it may end up being a bit more >> expensive to decode because you need more branches instead of using >> find-first-set type instructions. >> > > I don't think it requires many branches, because you can essentially do > > if (byte[0] <= 243) > length = 0 > else if (byte[0] <= 251) > length = byte[0] - 243 > else > { > length_bytes = byte[0] - 251; > ... read length_bytes, decode length > } > > but I haven't tried implementing it and maybe my intuition is wrong. > > Or maybe it'd be a good scheme for on-disk format, but poor for memory. > > >
This seems like quite an elegant scheme. Certainly worth trying out. I find it hard to believe that more than 4 length bytes would be needed (although that reminds me of the famous and possibly apocryphal quote "640K ought to be enough for anybody.") cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com