On 10/24/17 6:30 PM, Steve D'Aprano wrote:
On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote:

On 2017-10-23 04:21, Steve D'Aprano <steve+pyt...@pearwood.info> wrote:
On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:

If the probability of certain codes (either single codes, or sequences of
codes) are non-equal, then you can take advantage of that by encoding the
common cases into a short representation, and the uncommon and rare cases
into a longer representation. As you say:


   Otherwise, if ( 0, 0 ) is much more frequent,
   we can encode ( 0, 0 ) by "0" and

( 0, 1 ) by "101",
( 1, 0 ) by "110", and
( 1, 1 ) by "111".

   And we could then use /less/ than two bits on the
   average.

That's incorrect. On average you use 2.5 bits.

(1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)

I disagree. If the distribution is not equal, then the average needs to
take the different probabilities into account.

I think I would call that the *weighted* average rather than the average.

Regardless of what we call it, of course both you and Stefan are right in how
to calculate it, and such a variable-length scheme can be used to compress
the data.

E.g. given the sequence 00000011 which would take 8 bits in the obvious
encoding, we can encode it as "000111" which takes only 6 bits.

But the cost of this encoding scheme is that *some* bit sequences expand, e.g.
the 8 bit sequence 11111100 is encoded as "1111111110" which requires 10
bits.

The end result is that averaged over all possible bit sequences (of a certain
size), this encoding scheme requires MORE space than the obvious 0/1 bits.

But in practice we don't care much, because the data sequences we care about
are usually not "all possible bit sequences", but a heavily restricted subset
where there are lots of 00 pairs and fewer 01, 10, and 11 pairs.


My understanding of the 'Random Data Comprehensibility' challenge is that is requires that the compression take ANY/ALL strings of up to N bits, and generate an output stream no longer than the input stream, and sometime less. It admits that given no-uniformly distributed data, it is possible to compress some patterns, the common ones, and expand others, the uncommon ones, to lower the net average length. What it says can't be done is to have a compression method that compresses EVERY input pattern. That is where the 'Pigeon Hole' principle comes into play which the people who claim to be able to compress random data like to ignore or just attempt to say doesn't apply.

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to