On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote: > On 2017-10-23 04:21, Steve D'Aprano <steve+pyt...@pearwood.info> wrote: >> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote: >>> >> If the probability of certain codes (either single codes, or sequences of >> codes) are non-equal, then you can take advantage of that by encoding the >> common cases into a short representation, and the uncommon and rare cases >> into a longer representation. As you say: >> >> >>> Otherwise, if ( 0, 0 ) is much more frequent, >>> we can encode ( 0, 0 ) by "0" and >>> >>> ( 0, 1 ) by "101", >>> ( 1, 0 ) by "110", and >>> ( 1, 1 ) by "111". >>> >>> And we could then use /less/ than two bits on the >>> average. >> >> That's incorrect. On average you use 2.5 bits. >> >> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.) > > I disagree. If the distribution is not equal, then the average needs to > take the different probabilities into account.
I think I would call that the *weighted* average rather than the average. Regardless of what we call it, of course both you and Stefan are right in how to calculate it, and such a variable-length scheme can be used to compress the data. E.g. given the sequence 00000011 which would take 8 bits in the obvious encoding, we can encode it as "000111" which takes only 6 bits. But the cost of this encoding scheme is that *some* bit sequences expand, e.g. the 8 bit sequence 11111100 is encoded as "1111111110" which requires 10 bits. The end result is that averaged over all possible bit sequences (of a certain size), this encoding scheme requires MORE space than the obvious 0/1 bits. But in practice we don't care much, because the data sequences we care about are usually not "all possible bit sequences", but a heavily restricted subset where there are lots of 00 pairs and fewer 01, 10, and 11 pairs. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list