On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote: > r...@zedat.fu-berlin.de (Stefan Ram) writes: >>When we have a source of 2 random binary digits, >>there are 2^2 possible outcomes: >>( 0, 0 ), >>( 0, 1 ), >>( 1, 0 ), and >>( 1, 1 ). >>. One cannot aggree upon a code to represent each >>of those four possibilities with just one bit. > > If the probabilities of each outcome is 0.25.
When people talk about "random data", they are normally referring to equal probabilities. Just as when they talk about compression in this context, they mean lossless, reversible compression. If the probability of certain codes (either single codes, or sequences of codes) are non-equal, then you can take advantage of that by encoding the common cases into a short representation, and the uncommon and rare cases into a longer representation. As you say: > Otherwise, if ( 0, 0 ) is much more frequent, > we can encode ( 0, 0 ) by "0" and > > ( 0, 1 ) by "101", > ( 1, 0 ) by "110", and > ( 1, 1 ) by "111". > > And we could then use /less/ than two bits on the > average. That's incorrect. On average you use 2.5 bits. (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.) Where you can actually get compression is if you limit yourself to only encoding specially selected input data where the code pairs are non-random and you have many more (0,0) than other combinations. But that is highly non-random. For example, if we are compressing a black and white bitmap, say a fax, the most common data is a page which is mostly white with only a bit of black text. That gives us a bitmap with lots and lots of (0,0) pairs and relatively few of the other combinations, which means for the common case of "a page of black text on a white background" the compression works well. (I'm treating 0 as white and 1 as black.) But for purely random input, there will be roughly equal numbers of each pair of bit and on average the encoding will expand the data by 25% (two bits of input maps to 2.5 bits output, on average). > But two bits are (maximally )random if > the probability of every combination /is/ 0.25. In the example you give, we have four possible pairs of bytes: Input: (0,0) (0,1) (1,0) (1,1) # 2 bits on average Output: "0" "101" "110" "111" # 2.5 bits on average So if the incoming data is random (uniformly-distributed), this will expand the size of the data by 25%. Only if the data is non-random and biased heavily towards (0,0) will you see compression savings. Of course in the real world most data we want to compress is non-random, and we can get really impressive compressions. But the cost of that is that you can rarely compress data twice (if you do, you either get very little savings the second time, or it expands) and you cannot compress random (uniformly distributed) data. > (If other combinations than ( 0, 0 ) were extremely > rare, one could even improve the compression more.) What you say is true, but it has nothing to do with Danceswithnumbers' ignorant and ludicrous claim that he can compress random data and that there is no mathematical proof that you cannot. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list