[music-dsp] Thoughts on DSP books and neural networks

Peter S Wed, 04 Feb 2015 07:58:42 -0800

A few months ago someone suggested me to go home and read some
beginner books. Actually, nearly a decade ago, I co-authored a book on
modular audio processing titled "Visual VSTi Programming", teaching
people how to implement all the classical DSP algorithms. And before
you say that modular audio is only for beginners and lamers, let me
say that I've hand-written more than 200 DSP modules in C++, and that
book represents my knowledge 10 years ago. By education, I'm an IT
teacher and linguist.


Which knowledge do you think is worth more?

1) have read a book about it
2) implemented it in code
3) implemented it in code, and also taught others how to do it

Over the years, I invented and implemented several new DSP algorithms,
ones that you won't find in any book, anywhere. To show you the work I
have done over the years, I uploaded some audio demos of the musical
filters, effects and instruments that I invented. I think that is all
100% on-topic here, these are strictly *musical* filters. Before you
dismiss my theories, listen to my DSP work:

"DSP Algorithms of the Future"
http://morpheus.spectralhead.com/demos/

For comparison, I also included audio examples of Robert
Bristow-Johnson's cookbook biquad filters, who argued that my theories
are wrong, and I should do things by the book. I'm not telling you
which sounds better. You be the judge - which sounds better? The
"classic", old-school algorithms that you find in textbooks and old
papers, or the ones that came out of my head? After listening to my
demos, if you wanted to learn digital filters and synthesis, who would
you ask? Robert, or me? (You can also compare them to your own audio
algorithms, if you have any).

Personally, I prefer to invent the future that yet doesn't exist in
books, instead of re-iterating the past the 1000th time. Most of these
filter algorithms you won't find in any book, anywhere, as they're
entirely my own inventions, not following any 'traditinal' designs.
I'm a pragmatic person - I do what "works" in any given situation,
even if it contradicts someone's theory. Shannon's work is titled "A
Mathematical *Theory* of Communication" - which is, as the title says,
just a "theory" (although a very inspiring one). It's not a Bible or a
"One Universal Truth". As soon as you start applying "Shannon entropy"
to certain real-world scenario outside some 1950s telecommunications
context, it doesn't fully make sense.

Real-world example:

Imagine that I download the Titanic movie from the internet (during
which, it already became a 'message', more precisely, fragmented over
several TCP/IP packets). [And if you prefer to stay strictly on-topic,
you could substitute the Titanic with your favourite musical album,
say, encoded as PCM or MP3.]

Now imagine that I burn that movie to a DVD, and send it to you in a
message (say, in the mail). So I've sent the string of bits that
represent the Titanic movie to you over a non-noisy physical carrier
on an optical media. (More precisely, the optical media is "noisy"
from the dust and scratches, but it has its own built-in error
correction.)

Question: what is the "entropy" of my message?

Now, this question doesn't make much sense. First of all, how do you
even define that at all in this context? Second, what would even be
the *purpose* of assigning some real-valued number between 0-1 to a
movie? What? You say, the Titanic movie cannot be a message? Who said
that? Seriously, did Shannon say, that a message *cannot* be 4.5
gigabytes long? And if it *can* be 4.5 GB long (why not?), then why
couldn't it be the Titanic movie, encoded as x264 MPEG4? Why would one
make such arbitrary restriction? (Remember: *all* digital messages
consist of "bits", so any "digital message" is effectiely a "string of
bits".)

[Food for thought: the usage of the 0-1 range to express measure of
information, is an entirely arbitrarily chosen metric, which is
followed merely by convention. In practice, it could be any range.
Someone, somewhere in history, once said: "Hmm... I like the number
'1'.... Let's use that as upper bound for amount of information!" And
since, it's followed by tradition. Now, when I do practical computing
on fixed point numbers, I may find it more practical to use a fixed
point integer number instead, as the samples or pixels I am processing
are often in fixed point anyways. So why do int->float conversion in
the first place, when it's not necessary? Normalization is costy,
division being the *most expensive* operation, like, costing literally
30x as much as a single addition (which may make normalization
extremely costy if you want to do it a billion times), and is often
redundant, and may actually decrease precision or introduce
quantization errors. This should be trivial for anyone being
intimately familiar with floating point representation of numbers.

Since IIRC the originally suggested problem was that "you cannot
compute so many correlations" - in response to that, I gave the
*simplest*, dumbest possible decorrelation estimator that is the
easiest (=cheapest) to compute. That's what you asked for... Of
course, you can make it fancy, normalize it to one, etc. but that also
adds to the cost. You can also imply normalization during computation,
without actually doing it. So I skip normalization when it's not
needed, hence I don't always use 0-1 range to express something.
That's an arbitrarily chosen range anyways - consider this: how do you
even express 0-1 on an embedded microprocessor that has no floating
point operations and registers? In *practice*, even standard IEEE 754
floats are just approximations. There are no *precise* floating point
numbers in the digital world, everything is just "bits" and
"approximations" (unless you do symbolic processing or arbitrary
precision arithmetic, which is a whole different story). Again, in the
"real world", everything is just "bits", whatever symbolic information
they represent on a semantic level. Why are you missing the entire
physical layer in the OSI model?]

Now, for the sake of a thought experiment, let's imagine that we could
calculate the "Shannon-entropy" of the Titanic movie precisely when I
send it to you in a message:

"Well, at first I was unfamiliar with the movie when I watched it, so
it had an entropy of 1.0. Later I watched it again, but since I
remembered most of it already, it had an entropy of 0.2371625. A half
year later I watched it again, by which time I forgot about half of
it, so it had an entropy of 0.71264."

... to me, this doesn't make much sense. Assigning some real-valued
number between 0-1 to the Titanic movie to determine it's "information
content" for a certain recepient, doesn't sound any less nonsense to
me than how nonsense it sounds to you that I assign some real-valued
number to an arbitrary string of bits based on decorrelation analysis
to estimate its information content for the purpose of various
processing. (Well, you could define the entropy based on the
probability of "accidentally" sending you the bits that represent the
Titanic movie encoded as x264 MPEG4, if I started sending you bits
randomly. Which, for all practical purposes, has a probability of
zero.)

To me, it is a lot more interesting to test things like, "if I remove
a certain pixel from the movie, will your brain still perceive the
whole as the *same* information? Will your brain notice that "missing"
pixel? If not, then - in my view - that individual pixel has 'low
entropy', compared to those pixels, which are not a good idea to take
away as your brain will "notice" that. If your mind doesn't "perceive"
something and you can freely take it away without loss of perceived
content for the recepient, is it "information"? How many pixels can I
take away, until the recepient still "recognizes" the information?
Where do the "bits" end, and where does the "content" start?
Considering these questions is already useful for things like, video
compression, and we might call this "perception-based approach to
information". (As I know, Shannon was also concerned with data
compression, so this is relevant to his work.)

Once you start applying "Shannon entropy" in the strict sense to
certain real-world contexts, it kinda stops making sense. It's not
"impossible" to determine, it rather makes no sense in the context.
For example, if your goal is to determine how much information is
"new" in a message for a certain recepient, you could simply do some
social engineering, and hand out a questionaire to the recepient of
the message, and simply ask that information. If you give enough
incentive, and use the proper questions, the recepient will simply
tell you. After all, the notion that the recepient is already
'familiar' with some information, assumes that the recepient has
'intelligence', and so, can answer questions. If the recepient has no
intelligence, and cannot answer, then why are you even sending a
message? It makes no sense to send a message to a rock...

So if it follows from Shannon's theory that the recepeient has
intelligence, why not simply ask what you want to find out? Did
Shannon say, that you *cannot* ask a question from the recepient of
the message? Why make such an arbitrary restriction that "asking
questions is forbidden"? If it's not forbidden, then why not simply
ask the recepient: "On a scale of 0 to 1, how much of this info is new
to you?" In that case, it becomes a simple social engineering and
psychology question how to "extract" that information from the
recepient of the message as accurately as possible (assuming you want
to base your measure of information on how much the recepient "knows",
which already implies intelligence on the part of the receiver at the
other end of the channel). Again, it makes no sense to send a message
to a rock that cannot answer.

So the proper question to ask is: "what are you *really* trying to do
here, and why are you using the wrong approach?" Sure, if you make
several arbitrarily made up restrictions, like "message cannot be the
Titanic", "askings questions is forbidden", "information is based on
probability as a whole", "you're an idiot because I said so" etc. then
things become trivially impossible. But why even make such
restrictions in the first place? If you don't know something, you just
need to ask more questions, and keep asking until you have the answer
with a sufficient precision (which may take a long time). If you
instead just sit there and keep saying "it's impossible for me to find
out this information, because instead of simply asking it, I am trying
to determine the probabilty" - then sure, you'll never find out what
you want... But what are you even trying to *do* anyways? What
*real-world* problem are you trying to solve?

Give me your *real-world* problem, and I'll give you a solution or an
approximation. But if your "problem" that you're trying to solve
doesn't make sense outside some theoretical context, then I cannot
help you. In that case, your "problem" doesn't have much to do with
any real-world scenario. On the positive side, at least I can use my
own theory of entropy to do graphics processing, video processing,
image compression, video compression, smart enhancement filters,
motion detection, noise removal and SNR increasement, color
classification, face recognition, various practical information
estimates (sadly, all of which are off-topic here), and also new
musical effects, that I cannot do with "Shannon entropy", which makes
no sense outside some limited contexts. I believe, I also invented a
new type of audio filter with interesting properties that I'm still
invetigating, as I've never seen that type of filter described
anywhere in the literature. (I plan to turn this into practical audio
plugins, which is non-trivial as the filter is non-casual, and also
needs to solve the zero-delay feedback problem. BTW this filter alone
would make a very interesting DSP paper.)

But who cares - in the eyes of some arrogant academics who keep their
nose high (because they're afraid to admit that their knowledge is
limited), I'll be considered an idiot no matter how many great new
algorithms or audio filters I invent, because I contradicted their
Bible, and dared to use the highly-overused term 'entropy' to express
something. (Maybe I should have called it "decorrelation coefficient"
instead, to avoid the wrath of a bunch of Shannon-zealots. Anyways is
there some International Entropy Comittee which decides what can be
called "entropy", and what can't? I never heard of such... I think
it's one of the most overused technical terms in history.)

So at the end of the day, it doesn't matter how many books you've
read, or even what you *think*. The bottom line is: what have you
*done*? I've shown you what I've *done* in the field of music-DSP, you
heard the audio algorithms that I invented and implemented. As you
see, I'm a person who "gets things done". What have *you* done? Until
you've shown your work is worthy of respect, you're just talking.

Best regards,
Peter Schoffhauzer
researcher, inventor, author
CEO at Spectralhead Audio
IRC: #music-dsp on EFNet

P.S. For those interested in alternative methods of approximating the
'information' content of an a arbitrary message, here's an interesting
neural network experiment I did a few years ago, using standard
feedforward backpropagation learning that you'll find described in any
introductory beginner level artificial intelligence book:

http://scp.web.elte.hu/neuro/
        
As you see, the neural networks could 'extract' much of the
information present in the image fairly well, despite the fact that my
algorithm knew nothing about any kind of *probability*. Depending on
the complexity of the neural network, it gave approximations of the
'information' (=content) with varying levels of detail and precision.
So this is a kind of an "information extractor" algorithm, a form of
automated machine learning. Imagine that you can do the same for
audio, which is an area of my research.

Notice, that this can be also thought of as a "high-level lossy image
compression", as the image gets represented by a small number of
coefficients in form of weights of the neurons in the simulated neural
network. This is in fact, similar to how information is extracted and
stored in your brain in form of neuronal connections, as you mentally
process sensory input (=information). This can be later used for say,
face recognition, as it is like a filter that "resonates" at a certain
input. (If it weren't extremely slow to compute, this could be used
for image compression, and I'm considering trying to speed it up by
combining this with 2D wavelet processing.)

Sure, if you - for whatever reason - insist on defining "information"
based on probability as a *whole*, then this is of no use to you at
all. But what *purpose* would it serve to have a real-valued number
between 0-1 assigned to each of these images (assuming, they're each
sent in a message)? Say, Image #3 (when sent in a message to recepient
X) had a "Shannon-entropy" of 0.321679. So what? To me, that sounds
pretty boring and useless, regardless of context.... I think my
approaches are a lot more practical, exciting and fun, than merely
assigning a 0-1 valued number to the whole message, which is again,
pretty boring and of not much use (unless you use that information to
classify messages based on that number). Otherwise, what would you
*do* with that number? After all, even *knowing* it may not be useful
for anything practical at all. So then what are you even trying to
*do* ?

And why stop at the level of messages? The smallest unit of digital
information is a *bit*, not a *message*. All messages in digital
communication consist strictly of *bits*. I certainly remember a
passage from Shannon's "Theory of Communication" where he discusses
the information content in the individual symbols (letters, words
etc.) that constitute the message. If you take the lowest common
denominator of all symbols used in digital communication, you arrive
at *bits*, which is the tiniest symbol having a symbol space of two.
Why skip that part of Shannon's work entirely? Not all symbols in a
message have *equal* entropy... Did you even *read* what Shannon
wrote? Am I the only one on this mailing list who actually *read* this
section from Shannon? Are you guys trying to argue with Shannon
himself? I don't know *where* you got your information from, but
certainly not from Shannon.

If you missed that part where he's discussing the entropy of
individual symbols that constitute a message, you're certainly missing
something from his work... Where did he wrote that the only unit to
analyze is a *whole message*? He *certainly* analyzes symbolic
constituents and their *correlations*, for example if I send you the
word "mathematica" in a message, there's some amount of *correlation*
between the symbols (=letters) that constitute that message, due to
the peculiar nature of the English language, decreasing the "entropy"
as a whole due to the *correlation* of symbols, which was of interest
for Shannon, and is of interest to us linguists, and is similar to
what I was talking about earlier ("the more correlated the symbols
are, the less the entropy"). I merely took Shannon's approach, and
applied it to the symbols that constitute any digital message (=bits),
and analyze the *correlation* between bits, similar to how Shannon
analyzed the *correlation* between symbols (like letters) in a
message.

There could be a myriad ways of constructing algorithms that
approximate the *correlation* between symbols (= bits, letters, words,
pixels, samples, etc.), which, again, constituted an important part of
Shannon's work in determining the *entropy* of the whole message. Why
skip that part of his work entirely? I have no idea which *book* told
you that the only thing to analyze is "whole messages", but that is
*certainly* not true of Shannon's work. So I'd recommend to throw that
book far away, and read Shannon's work instead, carefully paying
attention to the sections where he discusses the entropy of individual
constituent symbols in a message. (Why are you even reading books that
completely skip half of the story? Those books are to be thrown away
or burned.)

These neural network algorithms didn't exist back in the prime days of
Shannon, and even using today's computers, it took me hours to render
each of those tiny images, as it's extremely computation-intensive to
'extract' or 'learn' information using simulated neural networks and
automated machine learning. (This is one of the reasons why I prefer
simple unnormalized bit arithmetics based decorrelation analysis
instead, as that's a lot faster to process a billion times, and gives
a perceptionally useful result.) If your "message" is a green
rectangle with a red disk in the centre, a simple neural network can
automatically "recognize" and extract that information easily, without
"knowing" anything about your message beforehand, which is routinely
used technology (also in audio processing, speech recognition etc.)
Without considering how information is perceived, processed and
retrieved in your own brain, and how you can simulate that, one thing
is certain: you'll never get the full picture about "information".

What I am concerned with, is not whether I can 'extract' the
information content of an arbitrary message or not, but rather: how to
do it *faster* and *better*, with less computation and better
precision. How can I recover lost bits of an information/signal? How
can I "enhance" the information, and suppress the noise and
low-entropy parts? How can I "compress" the signal by dropping certain
bits? How can I approximate the information using simulated neural
networks? How can I make better quality resamplers, noise reduction
and restoration filters? How can I make a computer "learn" a musical
piece or a musical style? How can I mathematically express what a
human brain considers "art"? How can I make better, faster, more
precise filters? To me, these sound a lot more interesting, than
merely "assigning a number between 0-1 to each message", which sounds
pretty useless overall, so *why* are you even trying to do *that*?
Without turning it into some *action*, it doesn't make much sense to
do...

Since I'm pragmatic, I do what "works", and I don't need anybody's
approval to do it. Especially not someone, whose bilinear transformed
biquad filters have more Nyquist warping and error compared to analog
prototypes, than mine, or someone who admittedly has "no clue of
entropy estimation". So, I don't want to disturb your arrogant
religious circles, sorry for any fuss I made. The sad fact is that
your religion is not even "Shannon-religion", rather, some dumbed-down
edition that completely skips a lot of Shannon's work. I believe if
Shannon lived today and joined this discussion, you would even argue
him, as he was concerned with *correlation* of *symbols* constituting
a message to determine it's entropy (if in doubt - read his work). If
you missed that part, you missed the point of his theory... (And for
some reason, some of you seem to be missing the physical layer in the
OSI model entirely.) I believe you would tell Shannon: "No, you're an
idiot! Go home and read some beginner books... We *don't* analyze the
correlation of the symbols in a message!!"

So, have fun with your books, I'm rather busy elsewhere, inventing the
future, and turning all my "weird" ideas into audio filters and
plugins.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

[music-dsp] Thoughts on DSP books and neural networks

Reply via email to