Re: [digitalradio] Re: digital voice within 100 Hz bandwidth

Robert Thompson Sun, 18 Nov 2007 12:14:52 -0800

Oops, sent too quickly. What I meant was: "That ( speech recognition
not being real time) is not entirely true." There are many commercial
packages that do minimal-lag "realtime" speech recognition. One
example would be the voice command features built into Apple's OSX.
Another would be any one of a number of speech-to-text transcription
packages.


I apologize if my unsupported and abrupt original phrasing appeared to
be inflammatory. Such was not intended.


On Nov 18, 2007 2:11 PM, Robert Thompson <[EMAIL PROTECTED]> wrote:
> That is not entirely true. Besides, I wasn't focusing so much on their
> "real" research as the voice characterization research that they had
> to do before they could usefully work on recognition. It turns out
> that the very areas that are most necessary for digital voice
> recognition are the ones most necessary for human brains to recognize
> and interpret. Voice is a mixed-information-density signal, and if you
> "simplify" the signal by filtering out and discarding the less
> necessary elements, you have significantly reduced the effort the next
> stage has to do, whether it's digital encoding or speech recognition.
>
>
>
> On Nov 18, 2007 1:31 PM, Mike Lebo <[EMAIL PROTECTED]> wrote:
> >
> >  Robert,
> >
> > I agree. The thing that is different is that speech recognition is not real
> > time. Voice over the radio is real time.
> >
> > Mike     n6ief
> >
> >
> >
> > On Nov 18, 2007 10:46 AM, Robert Thompson < [EMAIL PROTECTED]>
> > wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > There are several (military/gov) standard intelligibility tests that
> > > do a pretty good job of scoring what most humans can and can not
> > > reliably understand. You might try taking a look at them to get some
> > > ideas of which voice characteristics make the most difference to
> > > intelligibility. There is actually a surprising amount of data out
> > > there, especially if you include the data peripheral to the various
> > > computerized speech translator research projects. It's not *exactly*
> > > signal processing... but understanding what parts of the signal matter
> > > the most can be surprisingly helpful. This may be unusually
> > > productive, because as of yet there hasn't been a huge amount of
> > > cross-discipline work between the codec researchers and the
> > > speech-to-meaning researchers. While there's a lot of duplicate
> > > research in there, it tends to be from slightly different
> > > perspectives, and the "stereo view" can sometimes help.
> > >
> > >
> > >
> > >
> > > On Nov 18, 2007 9:12 AM, Mike Lebo <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi Vojtech,
> > > >
> > > > Thank you for your reply to my papers. I will do more work on the
> > phonemes.
> > > > The project I want to do uses new computers that were no available 10
> > years
> > > > ago. Every 10 mS a decision is made to send a one or a zero. To make
> > that
> > > > decision I have 68 parallel FFT's running in the background. I believe
> > the
> > > > brain could handle mispronounce words better than you think.
> > > >
> > > > Mike
> > > >
> > > >
> > > > On Nov 17, 2007 3:55 PM, r_lwesterfield <[EMAIL PROTECTED]>
> > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > I have a few radios (ARC-210-1851, PSC-5D, PRC-117F) at work that
> > operate
> > > > in MELP for a vocoder – Mixed Excitation Linear Prediction. We have
> > found
> > > > MELP to be superior (more human-like voice qualities – less Charlie
> > Brown's
> > > > teacher) to LPC-10 but we use far larger bandwidths than 100 khz. I do
> > not
> > > > know how well any of this will play out at such a narrow bandwidth.
> > > > Listening to Charlie Brown's teacher will send you running away quickly
> > and
> > > > you should think of your listeners . . . they will tire very quickly.
> > Just
> > > > because voice can be sent at such narrower bandwidths does not
> > necessarily
> > > > mean that people will like to listen to it.
> > > > >
> > > > >
> > > > >
> > > > > Rick – KH2DF
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > >
> > > > >
> > > > > From: digitalradio@yahoogroups.com
> > [mailto:[EMAIL PROTECTED]
> > > > On Behalf Of Vojtech Bubník
> > > > > Sent: Saturday, November 17, 2007 9:11 AM
> > > > > To: [EMAIL PROTECTED]; digitalradio@yahoogroups.com
> > > > > Subject: [digitalradio] Re: digital voice within 100 Hz bandwidth
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Hi Mike.
> > > > >
> > > > > I studied some aspects of voice recognition about 10 years ago when I
> > > > thought of joining a research group at Czech Technical University in
> > Prague.
> > > > I have a 260 pages text book on my book shelf on voice recognition.
> > > > >
> > > > > Voice signal has high redundancy if compared to a text transcription.
> > But
> > > > there is additional information stored in the voice signal like pitch,
> > > > intonation, speed. One could estimate for example mood of the speaker
> > from
> > > > the utterance.
> > > > >
> > > > > Voice tract could be described by a generator (tone for vowels, hiss
> > for
> > > > consonants) and filter. Translating voice into generator and filter
> > > > coefficients greatly decreases voice data redundancy. This is roughly
> > the
> > > > technique that the common voice codecs do. GSM voice compression is a
> > kind
> > > > of Algebraic Code Excited Linear Prediction. Another interesting codec
> > is
> > > > AMBE (Advanced Multi-Band Excitation) used by DSTAR system. GSM
> > half-rate
> > > > codec squeezes voice to 5.6kbit/sec, AMBE to 3.6 kbps. Both systems use
> > > > excitation tables, but AMBE is more efficient and closed source. I think
> > the
> > > > clue to the efficiency is in size and quality of the excitation tables.
> > To
> > > > create such an algorithm requires considerable amount of research and
> > data
> > > > analysis. The intelligibility of GSM or AMBE codecs is very good. You
> > could
> > > > buy the intelectual property of the AMBE codec by buying the chip. There
> > are
> > > > couple of projects running trying to built DSTAR into legacy
> > transceivers.
> > > > >
> > > > > About 10 years ago we at OK1KPI club experimented with an echolink
> > like
> > > > system. We modified speakfreely software to control FM transceiver and
> > we
> > > > added web interface to control tuning and subtone of the transceiver. It
> > was
> > > > a lot of fun and a very unique system at that time.
> > > > http://www.speakfreely.org/ The best compression factor offers LPC-10
> > codec
> > > > (3460kbps), but the sound is very robot-like and quite hard to
> > understand.
> > > > At the end we reverted to GSM. I think IVOX is a variant of the LPC
> > system
> > > > that we tried.
> > > > >
> > > > > Your proposal is to increase compression rate by transmitting
> > phonemes. I
> > > > once had the same idea, but I quickly rejected it. Although it may be a
> > nice
> > > > exercise, I find it not very useless until good continuous speech
> > > > multi-speaker multi-language recognition systems are available. I will
> > try
> > > > to explain my reasoning behind that statement.
> > > > >
> > > > > Let's classify voice recognition systems by the implementation
> > complexity:
> > > > > 1) Single-speaker, limited set of utterances recognized (control your
> > > > desktop by voice)
> > > > > 2) Multiple-speaker, limited set of utterances recognized (automated
> > phone
> > > > system)
> > > > > 3) dictating system
> > > > > 4) continuous speech transcription
> > > > > 5) speech recognition and understanding
> > > > >
> > > > > Your proposal will need implement most of the code from 4) or 5) to be
> > > > really usable and it has to be reliable.
> > > > >
> > > > > State of the art voice recognition systems use hidden Markov models to
> > > > detect phonemes. Phoneme is searched by traversing state diagram by
> > > > evaluating multiple recorded spectra. The phoneme is soft-decoded.
> > Output of
> > > > the classifier is a list of phonemes with their probabilities of
> > detection
> > > > assigned. To cope with phoneme smearing on their boundaries, either
> > > > sub-phonemes or phoneme pairs need to be detected.
> > > > >
> > > > > After the phonemes are classified, they are chained into words.
> > Depending
> > > > on the dictionary, most probable words are picked. You suppose that your
> > > > system will not need it. But the trouble are consonants. They carry much
> > > > less energy than vowels and are much easier to be confused. Dictionary
> > is
> > > > used to pick some second highest probability detected consonants in the
> > > > word. Not only the dictionary, but also the phoneme classifier is
> > language
> > > > dependent.
> > > > >
> > > > > I think human brain works in the same way. Imagine learning foreign
> > > > language. Even if you are able to recognize slowly pronounced words, you
> > > > will be unable to pick them in a fast pronounced sentence. The word will
> > > > sound different. Human needs considerable training to understand a
> > language.
> > > > You could decrease complexity of the decoder by constraining the
> > detection
> > > > to slowly dictated separate words.
> > > > >
> > > > > If you simply pick the high probability phoneme, you will experience
> > > > comprehension problems of people with hearing loss. Oh yes, I am
> > currently
> > > > working for hearing instrument manufacturer (I have nothing to do with
> > > > merck.com).
> > > > >
> > > > > from http://www.merck.com/mmhe/sec19/ch218/ch218a.html
> > > > > > Loss of the ability to hear high-pitched sounds often makes it more
> > > > difficult to understand speech. Although the loudness of speech appears
> > > > normal to the person, certain consonant sounds—such as the sound of
> > letters
> > > > C, D, K, P, S, and T—become hard to distinguish, so that many people
> > with
> > > > hearing loss think the speaker is mumbling. Words can be misinterpreted.
> > For
> > > > example, a person may hear "bone" when the speaker said "stone."
> > > > >
> > > > > For me, it would be very irritating to dictate slowly to a system
> > knowing
> > > > it will add some mumbling and not even having feedback about the errors
> > the
> > > > recognizer does. From my perspective, before good voice recognition
> > systems
> > > > are known, it is reasonable to stick to keyboard for extremely low bit
> > > > rates. If you would like to experiment, there are lot of open source
> > voice
> > > > recognition packages. I am sure you could hack it to output the most
> > > > probable phoneme detected and you may try yourself, whether the result
> > will
> > > > be intelligible or not. You do not need the sound generating system for
> > that
> > > > experiment, it is quite easy to read the written phonemes. After you
> > have a
> > > > good phoneme detector, the rest of your proposed software package is a
> > piece
> > > > of cake.
> > > > >
> > > > > I am afraid I will disappoint you. I do not contemn your work. I found
> > > > couple of nice ideas in your text. I like the idea to setup the varicode
> > > > table to code similarly sounding phonemes by neighbor codes and to code
> > > > phoneme length by filling gaps in the data stream by a special code. But
> > I
> > > > would propose you to read text book on voice recognition not to reinvent
> > the
> > > > wheel.
> > > > >
> > > > > 73 and GL, Vojtech OK1IAK
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Regards, Robert Thompson
> > >
> > > ====================================================
> > > ~ Concise, Complete, Correct: Pick Two
> > > ~ Faster, Cheaper, Better: Pick Two
> > > ~ Pervasive, Powerful, Trustworthy: Pick One
> > > ~ "Whom the computers would destroy, they first drive mad."
> > > ~ -- Anonymous
> > > ====================================================
> > >
> > >
> >
> >  
>
>
>
> --
>
> Regards, Robert Thompson
>
> ====================================================
> ~   Concise, Complete, Correct: Pick Two
> ~   Faster, Cheaper, Better: Pick Two
> ~   Pervasive, Powerful, Trustworthy: Pick One
> ~        "Whom the computers would destroy, they first drive mad."
> ~                            -- Anonymous
> ====================================================
>



-- 

Regards, Robert Thompson

====================================================
~   Concise, Complete, Correct: Pick Two
~   Faster, Cheaper, Better: Pick Two
~   Pervasive, Powerful, Trustworthy: Pick One
~        "Whom the computers would destroy, they first drive mad."
~                            -- Anonymous
====================================================

Re: [digitalradio] Re: digital voice within 100 Hz bandwidth

Reply via email to