On 02/17/2015 09:34 AM, Chris Angelico wrote:
On Wed, Feb 18, 2015 at 1:12 AM, Dave Angel <da...@davea.name> wrote:
They had a field type called a "compressed integer."  It could vary between
one byte and I think about six.  And the theory was that it took less space
than the equivalent format of fixed size integers.

Oh, incidentally: If you want a decent binary format for
variable-sized integer, check out the MIDI spec. Its varlen integer is
simple: each byte carries seven bits of payload, and the high bit is
set on all except the last byte. Anything under 128 is represented by
the byte with that value; after that, you get a series of continuation
bytes followed by a final byte. It could scale to infinity if you
wanted to, though I think the MIDI spec doesn't have anything that can
go quite that high, and it's a worst-case wastage of 12.5% (or,
looking the other way, worst-case expansion of 14.2%), compared to an
optimal eight-bit encoding. Given that MIDI often works with very
small numbers, but occasionally could have to cope with much larger
ones (eg "time to next event", which is often zero or very low, but
once in a while could be anything at all), this is a good trade-off.
If you know you're frequently going to work with numbers up to 2**31
but only occasionally larger than that, it might be better to take
four bytes at a time, and use the high bit as a continuation bit, in
the same way. It's all trade-offs.

I have used that 7 bits/byte encoding myself for variable length integers. It worked well enough for its purpose.


Decimal digits give you roughly 10/3 bits per character (== per byte,
if you encode UTF-8). If you consider that your baseline, you can then
try to justify your algorithm on the basis of "it's twice as efficient
as decimal". Hexadecimal gives you 4 bits per character, at a
relatively small cost; if you can't get better than about 6 bits per
byte average throughput, I would say your algorithm is completely
valueless - the difference will so rarely even matter, and you lose
all the benefits I mentioned in my last email.

I've tried to read through the original algorithm description, but I'm
not entirely sure: How many payload bits per transmitted byte does it
actually achieve?

Somehow when I first read the message, I missed the fact that there was a link to a web page with source code.

But the first thing I'd expect to see would be a target estimate of the anticipated distribution of number values/magnitudes. For example, if a typical integer is 1500 bits, plus/minus 200 bits, I'd probably try encoding in base 65535, and have a terminator character of 0xffff.

Without such a target, it's not a compression scheme, but an encoding one.

An interesting point of history. At the time of Huffman's paper, it was provable that it was the best possible lossless compression. But there were some unwritten assumptions, such as that each character would be encoded with a whole number of bits. Changing that assumption makes arithmetic coding possible. Next assumption was that probabilities of each character didn't depend on context. Changing that assumption enables LZ. And so on.

--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to