Re: [music-dsp] about entropy encoding

2015-07-17 Thread Peter S
On 18/07/2015, Peter S  wrote:
> In the tests that gave good results for *some* test material, I
> simply grouped adjacent bits. (*)

(*) ...and of course, building histograms from characters or samples
also means 'grouping adjacent bits', since a character means "8
adjacent bits", a sample means "N adjancent bits" etc. When the source
material is white noise, you can group bits in any sort of fashion
(bits, octets = bytes, multi-byte samples etc.), it will always give
uniform probability distribution (and hence, entropy of ~1),
irregardless of how you group the bits. For other (more correlated)
signals, this won't work as well.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] about entropy encoding

2015-07-17 Thread Peter S
On 18/07/2015, robert bristow-johnson  wrote:
>
> listen, one thing i have to remind myself here (because if i don't, i'm
> gonna get embarrassed) is to not underestimate either the level of
> scholarship nor the level of practical experience doing things related
> to music (or at least audio) and DSP.

Failure to understand the difference between the concepts of 'entropy'
and 'entropy rate' indicates something lacking in the scholarship
level. Failure to understand that log2(p) != 0 when p != 1 may also
indicate something lacking in the scholarship level (or some
conceptual misunderstanding). The notion of 'entropy' is already
well-defined in the literature, so it is not my task to define it, it
is better to point out to the references in the literature.

> listen, i understand LPC.

Great! I never said or thought or implied that you don't. I merely
noticed and expressed that a bitflip counter is equivalent to an error
estimator for a subset of linear predictors (namely, binary delta
coding), without implying anything about your knowledge or
understanding. (And for that matter, I did not address that particular
message to you personally, I just thought this particular topic could
be interesting for *some* readers.)

> i know how to do this stuff.  i know how to teach this stuff and in a
> previous lifetime have done it.

Great. When did I say or imply you don't? LPC is actually a fairly
basic concept.

> so be careful, lest you lose people's interest, to restrain any
> patronizing tone.  we're not as dumb as you think.

When did I say you're dumb? I merely pointed out to Ethan Duni, that
"entropy" != "entropy rate". Failure to understand this will result in
failure to understand that except in some rare corner cases, virtually
all signals have entropy. That doesn't imply that he is dumb (I don't
think he is dumb), that just implies that he doesn't understand the
difference between entropy and entropy rate. Hence, I pointed out to a
standard piece of literature that defines it, specifically because he
asked me to define it.

If you guys are eager enough to be patronizing and pointing out to the
supposedly missing areas of my knowledge, then I'll also be eager to
point out where your knowledge may be lacking. That's a fair deal,
isn't it? After you Robert told me that I am a crackpot giving me a
link to "Crackpot index" (I still remember - I don't forget), I think
I was not even harsh (and in light of that, you sound hypocritical).

Now let's get back to the topic.

> so here's my question(s) ... please read carefully because i'm the guy
> i can sorta see a generic measure of entropy doing that, but it assumes
> messages that are adjacent bits.  you can add to that messages that are
> grouped together in any imagined fashion, not just adjacent bits.

Exactly. That would give another correlation measure.

> there are virtually uncountable ways of grouping together the bits of a
> binary file to measure entropy.  i simply cannot believe you did it
> every-which-way.

Exactly. I did not do it in every possible way, though that would be
interesting to test (in a given window). For finite number of bits,
the number of combinations is finite, though for high N it will be
certainly impractical to compute. I would expect, the more
correlations you analyze, the more accurate measure you'll get.

> so, because i am too lazy to look around at previous posts, what is the
> *specific* way or ways you grouped together the bits into messages so
> that you could count their frequency of occurrence and get a measure of
> mean bits of information per message?

First, let me point out that there's no universally good answer to
this question, and in general, it depends on the message and the
source material (and the length of the analysis window, amount of
input, etc.). In my tests, I got fairly good results on *certain* test
data by using a sliding window, essentially what you described
(grouping adjacent bits and builiding a histogram). If I understand
right, the frequency of m1m2m3...mN will give the relative probability
(transition probability) of mN occurring after m1m2m3...mN-1. I got
this idea from Shannon's paper, where he does the same for letters to
analyze transition probabilities of English text. If the bits/symbols
are grouped in a different fashion, then the occurrence of that
pattern would indicate the relative probabilities of bits/symbols
being in that particular postion relative to each other (I haven't
tested this).

In the tests that I'm currently doing, I find that what worked well
for simple 1-bit signals, doesn't work well for multi-bit (audio)
signals, indicating that a different measure is needed. As said, there
are a zillion ways of analyzing possible correlations, so I cannot
give you an universally good measure. In fact, that's what I'd like to
find out myself as well. So this simple measure has very limited
application, and I find that it doesn't work well for more complex
signals. I have some furt

Re: [music-dsp] about entropy encoding

2015-07-17 Thread robert bristow-johnson

On 7/17/15 2:28 AM, Peter S wrote:

Dear Ethan,

You suggested me to be short and concise.
My kind recommendation to you:

1) Read "A Mathematical Theory of Communication".
2) Try to understand Theorem 2.
3) Try to see, when p_i != 1, then H != 0.

I hope this excercise will help you grasp this topic.


listen, one thing i have to remind myself here (because if i don't, i'm 
gonna get embarrassed) is to not underestimate either the level of 
scholarship nor the level of practical experience doing things related 
to music (or at least audio) and DSP.  just a friendly caution to 
consider the same.



On 7/17/15 3:27 PM, Peter S wrote:

A linear predictor[1] tries to "predict" the next sample as the linear
combination of previous samples as

 x'[n] = SUM [i=1..k] a_i * x[n-i]

where x'[n] is the predicted sample, and a_1, a_2 ... a_k are the
prediction coefficients (weights). This is often called linear
predictive coding (LPC),


listen, i understand LPC.  i also understand (i remember having multiple 
email exchanges with Claude Cellier about the topic in the '90s) how to 
use LPC to reduce the word size (if the LPC is effective, the LPC guess 
for the following sample is, statistically-speaking, usually close to 
the actual sample, so the word size is smaller and the probability on 
each of those reduced-width words is higher) followed by Huffman coding 
on those often used delta values.  if you have a non-flat spectrum, LPC 
will make it flat, and if you have a non-uniform p.d.f., Huffman coding 
will similarly take advantage of that.  you can combine the two and 
tweek it to the point that the losslessly compressed file can't get much 
more compressed without adding more to the model (like periodicity, so 
that the LPC doesn't just use the most current N samples to predict the 
future sample, but also uses a bunch of samples that are P samples 
prior, where P is the best-fit period for a quasi-periodic input).  i 
know how to do that (and done MATLAB implementations, but never anything 
that saw the commercial light-of-day).


i also know how i might do it for music that is the combination of a few 
sources, if i get good pitch-detection data from a source separation 
alg, which is another big, difficult problem.


i know how to do this stuff.  i know how to teach this stuff and in a 
previous lifetime have done it.  (i like to use a recurring character of 
George Carlin: Al Sleet, the Hippy-Dippy Weatherman, to illustrate 
Shannon's use of message probability, or frequency-of-occurance as a 
natural measure of the true information content a message has.)


so be careful, lest you lose people's interest, to restrain any 
patronizing tone.  we're not as dumb as you think.


On 7/17/15 1:38 PM, Peter S wrote:

On 17/07/2015, robert bristow-johnson  wrote:

On 7/17/15 1:26 AM, Peter S wrote:

On 17/07/2015, robert bristow-johnson   wrote:

in your model, is one sample (from the DSP semantic) the same as a
"message" (from the Information Theory semantic)?

A "message" can be anything - it can be a sample, a bit, a combination
of samples or bits, a set of parameters representing a square wave,
whatever.

doesn't answer my question.

It does, it just depends on the model.

In the 1-bit square wave duty cycle estimator, a message is a "bit".
In the English text compression experiment, a message is "a character".
In the white noise entropy estimation experiment, a message is a "byte".
In the binary waveform entropy experiment, a message is "a string of bits".
In the bitflip counter, a message is an event that "two consecutive bits differ"
In the parametric squarewave thought experiment, message is "a set of
parameteres describing a square wave".

Whatever arbitrarily chosen thing I send to you over a channel,
becomes a "message". There's no universal definition of what a
"message" is, depending on the particular model, it can be literally
anything.


so here's my question(s) ... please read carefully because i'm the guy 
who gets to decide if my question really got answered, and i am still 
pretty much clueless about it:


what is the atom of information that you are counting to create a histogram?

maybe it's sophisticated, first you count every 1-bit message and create 
a two-column histogram.


then rewind the soundfile and count every 2-bit message, grouped 
adjacently and add that to the histogram.


then rewind the soundfile again and count each 3-bit message, and add 
that to the histogram.


continue to some defined word-width limit.

i can sorta see a generic measure of entropy doing that, but it assumes 
messages that are adjacent bits.  you can add to that messages that are 
grouped together in any imagined fashion, not just adjacent bits.


there are virtually uncountable ways of grouping together the bits of a 
binary file to measure entropy.  i simply cannot believe you did it 
every-which-way.


so, because i am too lazy to look around at previous posts, what is the 
*specific* way or

Re: [music-dsp] about entropy encoding

2015-07-17 Thread Peter S
I tested a simple, first-order histogram-based entropy estimate idea
on various 8-bit signed waveforms (message=sample, no correlations
analyzed). Only trivial (non-bandlimited) waveforms were analyzed.

Method:
1) Signal is trivially turned into a histogram.
2) Probabilities assumed based on histogram frequencies.
3) Entropy rate of the probability distribution is calculated.

Detailed results:
-
http://morpheus.spectralhead.com/entropy/histogram/

Observations:
-
White noise:
- as window length increases, estimate converges to 1.
- as amplitude decreases, estimate converges to ~0.2.

Square wave:
- estimate is about 0.125, irregardless of frequency or amplitude.

Saw wave:
- as frequency increases, converges to ~0.2.
- as frequency decreases, converges to 1.

Sine wave:
- as frequency increases, converges to ~0.17.
- as frequency decreases, converges to ~0.95.
- as amplitude decreases, converges to ~0.2

Constant signal:
- estimate is always zero.

Conclusions:

As expected, being first-order, overall a poor estimator. For some
corner cases (white noise, constant signal), gives the correct result.
For other, periodic signals, gives an estimate with varying amounts of
error. Since it doesn't analyze correlations, it cannot distinguish
low frequency saw wave from white noise. Gives a rather poor estimate
for most low frequency periodic waveforms, does somewhat better for
high frequencies.

-P
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] about entropy encoding

2015-07-17 Thread Peter S
A linear predictor[1] tries to "predict" the next sample as the linear
combination of previous samples as

x'[n] = SUM [i=1..k] a_i * x[n-i]

where x'[n] is the predicted sample, and a_1, a_2 ... a_k are the
prediction coefficients (weights). This is often called linear
predictive coding (LPC), and one typical application is to encode the
formants in telephone telecommunications using source-filter model of
speech[2]. Another typical application is waveform encoding in
lossless wave compression codecs[3] like Shorten, FLAC, WavPack,
MPEG-4 ALS, etc.

The simplest possible form of a linear predictor is when a_1 = 1, and
all other coefficients a_2, a_3, ... a_n = 0. So it simply predicts
the next sample as

x'[n] = x[n-1]

In other words, it predicts the next sample to be the same as the previous one.

This special case of linear prediction is called "delta encoding[4]",
where the next sample is predicted to be the same, and only the
difference between two consecutive samples are encoded. One compressed
audio file format that uses this type of encoding is the Amiga IFF
8SVX file format[5].

If we simplify this delta encoding further and apply it to the
smallest finite field GF(2), then we get:

b'[n] = b[n-1]

where b[n] is the Nth bit, and b'[n] is the prediction for the Nth
bit. In this model, if we try to estimate the prediction error, then -
since GF(2) only has two elements: 0 and 1 - the difference (delta)
between two consecutive bits can be either 0 or 1.

Since the addition operation in GF(2) is XOR, the difference (delta)
between two consecutive bits equals A XOR B, that gives 1 for 0->1 or
1->0 transition, and gives zero for two identical bits. To calculate
the total prediction error, we simply need to sum the total number of
binary transitions in the binary string:

E = SUM [i=2..N] ( b[i] XOR b[i-1] )

where N is the total number of bits, and E is the total error between
the prediction and the actual samples.

So an algorithm that counts bitflips in a binary string, can be
thought of as the sum of the errors of the simplest possible linear
predictor over GF(2) with a_1 = 1 and a_2, a_3, ... a_n = 0, that
simply "predicts" that the next bit will be the same as the previous
bit. By counting the binary transitions, we get the sum of 'deltas',
or the sum of 'prediction error' of binary delta coding.

If we think of entropy as 'unpredictability' and we expect it to
(possibly nonlinearly) correlate with the sum of errors between a
prediction and actual samples, then this simplistic linear predictor
can be thought of as a very simplistic approximation to entropy.

Of course, this is an extremely simplistic and dumb predictor - it is
the simplest possible linear predictor that can possibly exist, as
GF(2) is the smallest finite field, and it only has a single
coefficient a_1 = 1. It cannot account for periodicity, previous bits,
statistical probabilities, or other correlations. It can be thought of
as an excercise in trying to create the "simplest possible predictor".

Despite this, and however simple this predictor may seem, by extending
this idea further I could successfully create a lossless PCM wave
compression algorithm with compression ratios comparable to that of
codecs using more complex linear predictors, like FLAC or Shorten.

(More on this later.)

Best regards,
Peter

References:
[1] https://en.wikipedia.org/wiki/Linear_prediction
[2] 
https://en.wikipedia.org/wiki/Source%E2%80%93filter_model_of_speech_production
[3] https://en.wikipedia.org/wiki/Linear_predictive_coding#Applications
[4] https://en.wikipedia.org/wiki/Delta_encoding
[5] https://en.wikipedia.org/wiki/8SVX
[6] https://en.wikipedia.org/wiki/GF%282%29
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] about entropy encoding

2015-07-17 Thread Peter S
On 17/07/2015, robert bristow-johnson  wrote:
> On 7/17/15 1:26 AM, Peter S wrote:
>> On 17/07/2015, robert bristow-johnson  wrote:
>>> in your model, is one sample (from the DSP semantic) the same as a
>>> "message" (from the Information Theory semantic)?
>> A "message" can be anything - it can be a sample, a bit, a combination
>> of samples or bits, a set of parameters representing a square wave,
>> whatever.
>
> doesn't answer my question.

It does, it just depends on the model.

In the 1-bit square wave duty cycle estimator, a message is a "bit".
In the English text compression experiment, a message is "a character".
In the white noise entropy estimation experiment, a message is a "byte".
In the binary waveform entropy experiment, a message is "a string of bits".
In the bitflip counter, a message is an event that "two consecutive bits differ"
In the parametric squarewave thought experiment, message is "a set of
parameteres describing a square wave".

Whatever arbitrarily chosen thing I send to you over a channel,
becomes a "message". There's no universal definition of what a
"message" is, depending on the particular model, it can be literally
anything.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp


Re: [music-dsp] about entropy encoding

2015-07-17 Thread robert bristow-johnson

On 7/17/15 1:26 AM, Peter S wrote:

On 17/07/2015, robert bristow-johnson  wrote:

in your model, is one sample (from the DSP semantic) the same as a
"message" (from the Information Theory semantic)?

A "message" can be anything - it can be a sample, a bit, a combination
of samples or bits, a set of parameters representing a square wave,
whatever.


doesn't answer my question.


is your measure the same as

  SUM{ prob(message) * -log2(prob(message)) }
 all messages

?

Yes. In this particular example, a "message" means "a set of
parameters that describe a parametric square wave". To distinguish
between two or more messages, at least 1 bit is needed.


and this question doesn't get answered until the previous one does.

--

r b-j  r...@audioimagination.com

"Imagination is more important than knowledge."



--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp