That is not entirely true. Besides, I wasn't focusing so much on their
"real" research as the voice characterization research that they had
to do before they could usefully work on recognition. It turns out
that the very areas that are most necessary for digital voice
recognition are the ones most necessary for human brains to recognize
and interpret. Voice is a mixed-information-density signal, and if you
"simplify" the signal by filtering out and discarding the less
necessary elements, you have significantly reduced the effort the next
stage has to do, whether it's digital encoding or speech recognition.


On Nov 18, 2007 1:31 PM, Mike Lebo <[EMAIL PROTECTED]> wrote:
>
>  Robert,
>
> I agree. The thing that is different is that speech recognition is not real
> time. Voice over the radio is real time.
>
> Mike     n6ief
>
>
>
> On Nov 18, 2007 10:46 AM, Robert Thompson < [EMAIL PROTECTED]>
> wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > There are several (military/gov) standard intelligibility tests that
> > do a pretty good job of scoring what most humans can and can not
> > reliably understand. You might try taking a look at them to get some
> > ideas of which voice characteristics make the most difference to
> > intelligibility. There is actually a surprising amount of data out
> > there, especially if you include the data peripheral to the various
> > computerized speech translator research projects. It's not *exactly*
> > signal processing... but understanding what parts of the signal matter
> > the most can be surprisingly helpful. This may be unusually
> > productive, because as of yet there hasn't been a huge amount of
> > cross-discipline work between the codec researchers and the
> > speech-to-meaning researchers. While there's a lot of duplicate
> > research in there, it tends to be from slightly different
> > perspectives, and the "stereo view" can sometimes help.
> >
> >
> >
> >
> > On Nov 18, 2007 9:12 AM, Mike Lebo <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi Vojtech,
> > >
> > > Thank you for your reply to my papers. I will do more work on the
> phonemes.
> > > The project I want to do uses new computers that were no available 10
> years
> > > ago. Every 10 mS a decision is made to send a one or a zero. To make
> that
> > > decision I have 68 parallel FFT's running in the background. I believe
> the
> > > brain could handle mispronounce words better than you think.
> > >
> > > Mike
> > >
> > >
> > > On Nov 17, 2007 3:55 PM, r_lwesterfield <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I have a few radios (ARC-210-1851, PSC-5D, PRC-117F) at work that
> operate
> > > in MELP for a vocoder – Mixed Excitation Linear Prediction. We have
> found
> > > MELP to be superior (more human-like voice qualities – less Charlie
> Brown's
> > > teacher) to LPC-10 but we use far larger bandwidths than 100 khz. I do
> not
> > > know how well any of this will play out at such a narrow bandwidth.
> > > Listening to Charlie Brown's teacher will send you running away quickly
> and
> > > you should think of your listeners . . . they will tire very quickly.
> Just
> > > because voice can be sent at such narrower bandwidths does not
> necessarily
> > > mean that people will like to listen to it.
> > > >
> > > >
> > > >
> > > > Rick – KH2DF
> > > >
> > > >
> > > >
> > > > ________________________________
> > >
> > > >
> > > > From: digitalradio@yahoogroups.com
> [mailto:[EMAIL PROTECTED]
> > > On Behalf Of Vojtech Bubník
> > > > Sent: Saturday, November 17, 2007 9:11 AM
> > > > To: [EMAIL PROTECTED]; digitalradio@yahoogroups.com
> > > > Subject: [digitalradio] Re: digital voice within 100 Hz bandwidth
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Hi Mike.
> > > >
> > > > I studied some aspects of voice recognition about 10 years ago when I
> > > thought of joining a research group at Czech Technical University in
> Prague.
> > > I have a 260 pages text book on my book shelf on voice recognition.
> > > >
> > > > Voice signal has high redundancy if compared to a text transcription.
> But
> > > there is additional information stored in the voice signal like pitch,
> > > intonation, speed. One could estimate for example mood of the speaker
> from
> > > the utterance.
> > > >
> > > > Voice tract could be described by a generator (tone for vowels, hiss
> for
> > > consonants) and filter. Translating voice into generator and filter
> > > coefficients greatly decreases voice data redundancy. This is roughly
> the
> > > technique that the common voice codecs do. GSM voice compression is a
> kind
> > > of Algebraic Code Excited Linear Prediction. Another interesting codec
> is
> > > AMBE (Advanced Multi-Band Excitation) used by DSTAR system. GSM
> half-rate
> > > codec squeezes voice to 5.6kbit/sec, AMBE to 3.6 kbps. Both systems use
> > > excitation tables, but AMBE is more efficient and closed source. I think
> the
> > > clue to the efficiency is in size and quality of the excitation tables.
> To
> > > create such an algorithm requires considerable amount of research and
> data
> > > analysis. The intelligibility of GSM or AMBE codecs is very good. You
> could
> > > buy the intelectual property of the AMBE codec by buying the chip. There
> are
> > > couple of projects running trying to built DSTAR into legacy
> transceivers.
> > > >
> > > > About 10 years ago we at OK1KPI club experimented with an echolink
> like
> > > system. We modified speakfreely software to control FM transceiver and
> we
> > > added web interface to control tuning and subtone of the transceiver. It
> was
> > > a lot of fun and a very unique system at that time.
> > > http://www.speakfreely.org/ The best compression factor offers LPC-10
> codec
> > > (3460kbps), but the sound is very robot-like and quite hard to
> understand.
> > > At the end we reverted to GSM. I think IVOX is a variant of the LPC
> system
> > > that we tried.
> > > >
> > > > Your proposal is to increase compression rate by transmitting
> phonemes. I
> > > once had the same idea, but I quickly rejected it. Although it may be a
> nice
> > > exercise, I find it not very useless until good continuous speech
> > > multi-speaker multi-language recognition systems are available. I will
> try
> > > to explain my reasoning behind that statement.
> > > >
> > > > Let's classify voice recognition systems by the implementation
> complexity:
> > > > 1) Single-speaker, limited set of utterances recognized (control your
> > > desktop by voice)
> > > > 2) Multiple-speaker, limited set of utterances recognized (automated
> phone
> > > system)
> > > > 3) dictating system
> > > > 4) continuous speech transcription
> > > > 5) speech recognition and understanding
> > > >
> > > > Your proposal will need implement most of the code from 4) or 5) to be
> > > really usable and it has to be reliable.
> > > >
> > > > State of the art voice recognition systems use hidden Markov models to
> > > detect phonemes. Phoneme is searched by traversing state diagram by
> > > evaluating multiple recorded spectra. The phoneme is soft-decoded.
> Output of
> > > the classifier is a list of phonemes with their probabilities of
> detection
> > > assigned. To cope with phoneme smearing on their boundaries, either
> > > sub-phonemes or phoneme pairs need to be detected.
> > > >
> > > > After the phonemes are classified, they are chained into words.
> Depending
> > > on the dictionary, most probable words are picked. You suppose that your
> > > system will not need it. But the trouble are consonants. They carry much
> > > less energy than vowels and are much easier to be confused. Dictionary
> is
> > > used to pick some second highest probability detected consonants in the
> > > word. Not only the dictionary, but also the phoneme classifier is
> language
> > > dependent.
> > > >
> > > > I think human brain works in the same way. Imagine learning foreign
> > > language. Even if you are able to recognize slowly pronounced words, you
> > > will be unable to pick them in a fast pronounced sentence. The word will
> > > sound different. Human needs considerable training to understand a
> language.
> > > You could decrease complexity of the decoder by constraining the
> detection
> > > to slowly dictated separate words.
> > > >
> > > > If you simply pick the high probability phoneme, you will experience
> > > comprehension problems of people with hearing loss. Oh yes, I am
> currently
> > > working for hearing instrument manufacturer (I have nothing to do with
> > > merck.com).
> > > >
> > > > from http://www.merck.com/mmhe/sec19/ch218/ch218a.html
> > > > > Loss of the ability to hear high-pitched sounds often makes it more
> > > difficult to understand speech. Although the loudness of speech appears
> > > normal to the person, certain consonant sounds—such as the sound of
> letters
> > > C, D, K, P, S, and T—become hard to distinguish, so that many people
> with
> > > hearing loss think the speaker is mumbling. Words can be misinterpreted.
> For
> > > example, a person may hear "bone" when the speaker said "stone."
> > > >
> > > > For me, it would be very irritating to dictate slowly to a system
> knowing
> > > it will add some mumbling and not even having feedback about the errors
> the
> > > recognizer does. From my perspective, before good voice recognition
> systems
> > > are known, it is reasonable to stick to keyboard for extremely low bit
> > > rates. If you would like to experiment, there are lot of open source
> voice
> > > recognition packages. I am sure you could hack it to output the most
> > > probable phoneme detected and you may try yourself, whether the result
> will
> > > be intelligible or not. You do not need the sound generating system for
> that
> > > experiment, it is quite easy to read the written phonemes. After you
> have a
> > > good phoneme detector, the rest of your proposed software package is a
> piece
> > > of cake.
> > > >
> > > > I am afraid I will disappoint you. I do not contemn your work. I found
> > > couple of nice ideas in your text. I like the idea to setup the varicode
> > > table to code similarly sounding phonemes by neighbor codes and to code
> > > phoneme length by filling gaps in the data stream by a special code. But
> I
> > > would propose you to read text book on voice recognition not to reinvent
> the
> > > wheel.
> > > >
> > > > 73 and GL, Vojtech OK1IAK
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> >
> > Regards, Robert Thompson
> >
> > ====================================================
> > ~ Concise, Complete, Correct: Pick Two
> > ~ Faster, Cheaper, Better: Pick Two
> > ~ Pervasive, Powerful, Trustworthy: Pick One
> > ~ "Whom the computers would destroy, they first drive mad."
> > ~ -- Anonymous
> > ====================================================
> >
> >
>
>  



-- 

Regards, Robert Thompson

====================================================
~   Concise, Complete, Correct: Pick Two
~   Faster, Cheaper, Better: Pick Two
~   Pervasive, Powerful, Trustworthy: Pick One
~        "Whom the computers would destroy, they first drive mad."
~                            -- Anonymous
====================================================

Reply via email to