There's an entropy estimation equation that I occasionally see in
papers about compression, for example in:

"Binary-Valued Wavelet Decompositions of Binary Images"
http://www.eurasip.org/Proceedings/Eusipco/1996/paper/mdsp_3.pdf

The entropy estimation equation is this, found at the bottom of page 3:

    H(p) = -p*log2(p) - (1-p)*log2(1-p)              (1)

....where "p is the number of nonzero pixels in the image, divided by
the total number of pixels in the image" (the image being 1-bit black
and white).

This equation is the effectively the same as the one I gave in the
earlier post, counting the occurrences of symbols, except it is
applied to only two symbols: 0 and 1. It estimates the probability of
a bit being set as 'p', which is the number of bits set divided by the
total number of bits. In other words, it's effectively a counter -
they just count the bits that are set. Then they normalize it by
dividing by the total number of bits to get a value between 0-1, and
call that "probability".

If the probability of a bit being set is p, then the probability of a
bit being zero is 1-p, as there are only two symbols, and the total
probability is 1, so p + (1-p) = 1. So (1-p) in the equation refers to
the probability of a bit being zero.

Using Shannon's formula, the "entropy" of an event with probability
"p" is -log2(p). So to estimate the total entropy of the bits in the
image, they add the entropy of bits that are set, and the entropy of
bit that are unset. So they multiply the entropy of ones -log2(p) by
the probability of a bit being set (p) to get -p*log2(p), and the
entropy of zeros -log2(1-p) multiplied by the probability of a bit
being zero (1-p) to get -(1-p)*log2(1-p). If you sum these two, you
get the above formula, and they call it 'bits per pixels'.

I'm not saying this is necessarily correct, I'm just telling you how
this formula is derived, and how the probabilities of each symbol are
estimated. Apparently when these folks try to implement image
compression, then they try to minimize this formula. When graphed,
this function pretty much looks like an inverted parabola, peaking at
p=0.5:

http://morpheus.spectralhead.com/img/entropy-eq.png

In addition, the entropy of bits is shown in red, and the entropy of
zeros is shown in blue. So in other words, this graph shows the total
entropy of a system with two symbols, where the probability of symbols
are p and (1-p), respectively. When all bits are either zero or one,
then the entropy is 0. So when doing image compression, they try to
make all bits either one or zero. It also follows from the graph that
uniform distribution of 1s and 0s would have maximum entropy.

Let me add a comment here. In the paper linked above, these folks are
trying to minimize the entropy of an image of a ship using a binary
wavelet transformation, and using this formula, they find that the
'entropy' is decreased from 0.995 bpp (bit per pixel) to 0.212 bpp,
using formula (1).

My opinion is that they're probably wrong - if you see the two images,
you'll see that the first one has very long continuous areas that are
either 1 or 0. So, that image would be easily compressable using
run-length encoding. In other words, the above entropy formula doesn't
factor in the correlation between symbols (bits), namely that they're
continuous areas of the same color (either 1 or 0), so it won't
necessarily tell them, how well the image would compress using certain
compression algorithms, like run-length encoding. In other words,
formula (1) doesn't translate well to real-world results.

If you remember from some time ago, I gave a formula that - instead of
counting bits that are set, dividing it by the total number of bits,
like equation (1) - rather, counts the number of bit _flips_, and
divides it by the total amount of possible bitflips. So in that case,
"p" would be:

           no. of bitflips
    p = -----------------------
         total no. of bits - 1

So an entropy estimation  formula that counts bitflips instead of
bits, analogous to equation (1), could be:

    H(p) = -p*log2(p) - (1-p)*log2(1-p)              (2)
        
where p is the number of bitflips divided by the total number of bits
minus one, in other words, it's the probability of two adjacent bits
being flipped, and (1-p) is the probability of two adjacent bits being
equal. So with this equation, when all nearby bits are flipped
(pattern 010101010101...), it would give an entropy of zero.

Equation (1) estimates entropy by counting the bits that are set, and
equation (2) estimates entropy by counting bitflips. In my opinion,
(2) would estimate better how well an image can be compressed, because
that analyzes the correlation between adjacent symbols (bits), not
just counts the number of bits set, which is a crude estimate. For
example, if half of the bits are set, then equation (1) would give an
entropy of 1, yet when the first half of the image is all white, and
the second half is all black, then that doesn't have much entropy, and
can be compressed with very high ratios using run-length encoding
(which is effectively an encoder with a window length of 2).

On the other hand, my equation would give a near zero entropy estimate
for that image, as there is only a single bitflip in the entire image,
which would be more correct estimate of entropy and compressability,
as common image encoding algorithms (GIF, PNG etc.) would optimize
that out. Yet you'll see a simple bit counter metric in image
processing papers. I hope you see now where my 'bitflip-counter'
entropy estimate comes from - it factors in that adjacent equal bits
can be compressed well by run-length encoding, which is one of the
simplest forms of data compression, hence that will correlate with
lower entropy. Although this estimate only sees a 2-bit long window,
so it doesn't properly estimate the compressability against encoders
that employ a longer window. Hence, a simple first-order approximator.
Running test to determine the statistical correlation between this
value and various image encodings, would be an interesting project.

Also, both graphs effectively look like an inverted parabola, so if
one wants to use it to minimize/maximize number of the set bits, a
function that gives a triangle that peaks at 0.5 would be equally
good, for example this formula gives such a triangle:

         H(p) = 1-abs(-(2*p-1))
        
which has this graph, compared to eq. (1):

http://morpheus.spectralhead.com/img/triangle-eq.png

After all, often one just wants to see if the value got closer to zero
or it got closer to one. Since both the parabola-like log-based
function and the triangle are strictly monotonic, both would indicate
if the "entropy estimate" increased or decreased, so both are suitable
for finding some minima or maxima (without caring what the actual
value is), for example to fine-tune some data compression algorithm.
Often, the actual entropy doesn't matter, only if it increased or
decreased.

If others are routintely counting bits to estimate entropy, I think I
am not crazy to count bitflips to estimate entropy. It is just another
way of looking at information, breaking down the signal into state
transitions, instead of breaking it down into bits. If I am crazy,
then these researchers, PhDs, postgraduates and other university folks
writing those image processing papers are crazier than me, as they
just count the set bits to estimate entropy. That's not perfect, but
even that is sometimes 'good enough', for example to tell if a data
compression algorithm got better or worse (although their metric has
flaws). There are plenty of ways of estimating the amount of
information.

So, the "simplest" entropy estimate would be to count the number of
bits that are set, divide it by the total number of bits, call that
probability, and put that into equation (1). I'm not making this up,
you'll see this entropy estimate in image processing papers (and it's
effectively an estimator of the bias in the probability distribution).
If I'm not mistaken, for 1-bit waveforms it would estimate the bias,
giving maximum entropy for uniform distribution samples, and giving
minimum entropy for a fully biased signal (either all ones or all
zeros). Which is no surprise, as uniform distribution is the maximum
entropy probability distribution, and a constant signal has nearly
zero entropy.

So if you ask, what does this have to do with *music* DSP
processing... well, music signals are typically represented by PCM
samples, and PCM samples consist of *bits*. Algorithms that you can
use in 2 dimensions, you can also use in 1 dimension. For example, if
would be interesting to test how the number of bits set (using
equation (1)) would correlate with the compressability of a PCM
waveform with various binary decompositions (for example how well a
losseless wave compressor can compress it), which is a potential area
of research. My half-educated guess would be, the more bits are zero,
the better the signal will compress, as more zero bits will mean that
the sample values are - on average - lower, which will produce a
shorter code when compressed. Hence, less entropy. If you do not see
why this is interesting for the purpose of music or audio processing,
then you may lack imagination. My only problem is, finding the time to
try out all these things (and finding the time to write about them).

Working on 1 bit samples over GF(2) makes things immensely simple, as
you only have two symbols: 0 and 1. And once you have an algorithm
that works on 1-bit samples, then you can use that on any bit depth,
as any waveform can be trivially broken down into 1-bit square waves.
If you have an algorithm that can compress 1-bit black and white
images, then you can compress any 24-bit RGB image by simply breaking
it down into R, G, B color planes, and breaking down each color plane
into 8 individual bitplanes. Same goes for audio - it's just numbers,
often irrelevant whether they represent pixel colors or audio samples,
effectively both mean some 'sampled value'. So you can think of bits
as generalized forms of information, whether they represent pixels,
samples, symbols, or whatever else. Probably I'll show later how you
can compress any PCM waveform by breaking it down to individual
bitplanes. (And since in both case it's just numbers, I can use the
same algorithm to compress images as well.)

Best regards,
Peter Schoffhauzer
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

Reply via email to