Great talk, Charlie, thanks for coming and sharing your passion for this
stuff. I'm about as unmusical as any human, which is still pretty musical,
so it's great to learn from someone who's been thinking about this so much.


Here's a kind of outline for a campaign to do the MIDI analysis. Let me
know if this makes sense. I have some ideas about the audio to MIDI, but
they need this working first.

>From both a theory and a practice point of view, we're concerned with what
work needs to be done subcortically (the encoder) and what can be learned
cortically (in the HTM). Nature will hopefully have made similar decisions,
based on a similar tradeoff. Encoders have to be evolved (or programmed in
our case), which is very expensive compared to learning in cortex (or just
shovelling data into HTM). But you can't just shove raw data into HTM,
because it needs some semantic encoding to extract structure from. You also
aren't going to persuade Nature to evolve encoders for music in
anticipation of it, so there's a strict upper bound on the hand-engineering
of encoders.

The analogy with vision is important. It suggests we need to have a
hierarchy (including the encoders) which at the bottom is topologically
mapped to the midi channels (which are a proxy for cochlear frequency
bands). To first order, let's ignore velocity values and just pretend it's
a binary stream. To simplify further (I believe your preprocessor does
this), we should limit temporal resolution to fixed bins and thus eliminate
very fine differences in onset and offsets across channels from the
performance, which will more closely resemble the composition.

OK, now the musical information is contained in the position and timing of
onsets and (less importantly and usefully) offsets. Assuming we're
streaming, we won't know when the offsets are until they happen. So a
simple encoder would have a bit per channel per not-off state - onset, on,
offset. We usually give these each a small width so they're potentially
seen by lots of columns, let's say 8 replicated bits per channel per state.
We can reduce this to 4 or 2 at the extremes if they're seldom used in real
MIDI files. So the encoding would be say 64 * 3 * 8 + 32 * 3 * 4 + 32 * 3 *
2 = 528 * 4 = 2112 bits.

You'd then put a topological HTM layer (L4) on top of this, with each
column seeing inputs from an octave up and down, say. The SDRs this would
produce after learning would include localised representations of
individual chords in each stage of their existence, so the columns for the
middle C-major chord onset will appear when that chord is first played,
followed by columns for middle-C major "on", then middle-C major "off" and
then none of the above. A temporal pooling layer (L2/3) above this will
show "middle C-major" during the on-time of the chord, and can learn that
melody using sequence memory.

To track multiple voices, you may be able to use this one region. The
sequences of chords in L2/3 might be predictable enough to hold together in
parallel, but you'll still get a union of their SDR's. Or you might need
another region on top to separate the voices.

You can do key identification simply by counting up the L2/3 chords (per
voice) used in recent bars of the music. The chords which occur over a
short period give information about the probable key. Some of the L2/3
columns will automatically indicate this because they happen to be
temporally pooling over exactly the set of chords for one key. A classifier
or higher region trained to predict just the key (over many melodies, each
in the same key) will give you the key.

To identify a melody, you'd need to add another region above this analysis
level, which would take chord SDRs from lower region L2/3 into L4 and
temporally pool over them to produce melody SDRs.

To learn composer, style or mood, you'd provide the "melody identifier"
region with encodings of the composer, style or mood as L1 input along with
the melody data from this level. You'd feed the output of the higher region
in as L1 predictive input to the L2/3 layer and have all regions learn
together. This would potentially give you a generative model which might
reproduce Bach or Beethoven style music.


Regards,

Fergal Byrne


On Sat, Jun 13, 2015 at 8:19 PM, Matthew Lohbihler <[email protected]
> wrote:

>  Perhaps it's the opposite, and explains tinitis. Being someone who has
> this, i can confirm that the pitch is precisely constant.
>
>
> On 6/13/2015 2:34 PM, Tim Boudreau wrote:
>
> Interesting idea, but for it to be correct, shouldn't we have observed
> cases of highly pitch-specific deafness - i.e. you lose 435-445Hz, but 446
> is fine?
>
>  -Tim
>
> On Sat, Jun 13, 2015 at 2:30 AM, Matthew Taylor <[email protected]> wrote:
>
>> Interesting reading:
>> http://hyperphysics.phy-astr.gsu.edu/hbase/sound/place.html
>>
>> Sent from my MegaPhone
>>
>
>
>
>  --
>  http://timboudreau.com
>
>
>


-- 

Fergal Byrne, Brenter IT @fergbyrne

http://inbits.com - Better Living through Thoughtful Technology
http://ie.linkedin.com/in/fergbyrne/ - https://github.com/fergalbyrne

Founder of Clortex: HTM in Clojure -
https://github.com/nupic-community/clortex
Co-creator @OccupyStartups Time-Bombed Open License http://occupystartups.me

Author, Real Machine Intelligence with Clortex and NuPIC
Read for free or buy the book at https://leanpub.com/realsmartmachines

e:[email protected] t:+353 83 4214179
Join the quest for Machine Intelligence at http://numenta.org
Formerly of Adnet [email protected] http://www.adnet.ie

Reply via email to