Robert,

I agree. The thing that is different is that speech recognition is not real
time. Voice over the radio is real time.

Mike     n6ief

On Nov 18, 2007 10:46 AM, Robert Thompson <[EMAIL PROTECTED]>
wrote:

>   There are several (military/gov) standard intelligibility tests that
> do a pretty good job of scoring what most humans can and can not
> reliably understand. You might try taking a look at them to get some
> ideas of which voice characteristics make the most difference to
> intelligibility. There is actually a surprising amount of data out
> there, especially if you include the data peripheral to the various
> computerized speech translator research projects. It's not *exactly*
> signal processing... but understanding what parts of the signal matter
> the most can be surprisingly helpful. This may be unusually
> productive, because as of yet there hasn't been a huge amount of
> cross-discipline work between the codec researchers and the
> speech-to-meaning researchers. While there's a lot of duplicate
> research in there, it tends to be from slightly different
> perspectives, and the "stereo view" can sometimes help.
>
>
> On Nov 18, 2007 9:12 AM, Mike Lebo <[EMAIL PROTECTED]<mike-lebo%40ieee.org>>
> wrote:
> >
> > Hi Vojtech,
> >
> > Thank you for your reply to my papers. I will do more work on the
> phonemes.
> > The project I want to do uses new computers that were no available 10
> years
> > ago. Every 10 mS a decision is made to send a one or a zero. To make
> that
> > decision I have 68 parallel FFT's running in the background. I believe
> the
> > brain could handle mispronounce words better than you think.
> >
> > Mike
> >
> >
> > On Nov 17, 2007 3:55 PM, r_lwesterfield <[EMAIL 
> > PROTECTED]<r_lwesterfield%40bellsouth.net>
> >
> > wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > I have a few radios (ARC-210-1851, PSC-5D, PRC-117F) at work that
> operate
> > in MELP for a vocoder – Mixed Excitation Linear Prediction. We have
> found
> > MELP to be superior (more human-like voice qualities – less Charlie
> Brown's
> > teacher) to LPC-10 but we use far larger bandwidths than 100 khz. I do
> not
> > know how well any of this will play out at such a narrow bandwidth.
> > Listening to Charlie Brown's teacher will send you running away quickly
> and
> > you should think of your listeners . . . they will tire very quickly.
> Just
> > because voice can be sent at such narrower bandwidths does not
> necessarily
> > mean that people will like to listen to it.
> > >
> > >
> > >
> > > Rick – KH2DF
> > >
> > >
> > >
> > > ________________________________
> >
> > >
> > > From: digitalradio@yahoogroups.com 
> > > <digitalradio%40yahoogroups.com>[mailto:
> digitalradio@yahoogroups.com <digitalradio%40yahoogroups.com>]
> > On Behalf Of Vojtech Bubník
> > > Sent: Saturday, November 17, 2007 9:11 AM
> > > To: [EMAIL PROTECTED] <mike-lebo%40ieee.org>;
> digitalradio@yahoogroups.com <digitalradio%40yahoogroups.com>
> > > Subject: [digitalradio] Re: digital voice within 100 Hz bandwidth
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hi Mike.
> > >
> > > I studied some aspects of voice recognition about 10 years ago when I
> > thought of joining a research group at Czech Technical University in
> Prague.
> > I have a 260 pages text book on my book shelf on voice recognition.
> > >
> > > Voice signal has high redundancy if compared to a text transcription.
> But
> > there is additional information stored in the voice signal like pitch,
> > intonation, speed. One could estimate for example mood of the speaker
> from
> > the utterance.
> > >
> > > Voice tract could be described by a generator (tone for vowels, hiss
> for
> > consonants) and filter. Translating voice into generator and filter
> > coefficients greatly decreases voice data redundancy. This is roughly
> the
> > technique that the common voice codecs do. GSM voice compression is a
> kind
> > of Algebraic Code Excited Linear Prediction. Another interesting codec
> is
> > AMBE (Advanced Multi-Band Excitation) used by DSTAR system. GSM
> half-rate
> > codec squeezes voice to 5.6kbit/sec, AMBE to 3.6 kbps. Both systems use
> > excitation tables, but AMBE is more efficient and closed source. I think
> the
> > clue to the efficiency is in size and quality of the excitation tables.
> To
> > create such an algorithm requires considerable amount of research and
> data
> > analysis. The intelligibility of GSM or AMBE codecs is very good. You
> could
> > buy the intelectual property of the AMBE codec by buying the chip. There
> are
> > couple of projects running trying to built DSTAR into legacy
> transceivers.
> > >
> > > About 10 years ago we at OK1KPI club experimented with an echolink
> like
> > system. We modified speakfreely software to control FM transceiver and
> we
> > added web interface to control tuning and subtone of the transceiver. It
> was
> > a lot of fun and a very unique system at that time.
> > http://www.speakfreely.org/ The best compression factor offers LPC-10
> codec
> > (3460kbps), but the sound is very robot-like and quite hard to
> understand.
> > At the end we reverted to GSM. I think IVOX is a variant of the LPC
> system
> > that we tried.
> > >
> > > Your proposal is to increase compression rate by transmitting
> phonemes. I
> > once had the same idea, but I quickly rejected it. Although it may be a
> nice
> > exercise, I find it not very useless until good continuous speech
> > multi-speaker multi-language recognition systems are available. I will
> try
> > to explain my reasoning behind that statement.
> > >
> > > Let's classify voice recognition systems by the implementation
> complexity:
> > > 1) Single-speaker, limited set of utterances recognized (control your
> > desktop by voice)
> > > 2) Multiple-speaker, limited set of utterances recognized (automated
> phone
> > system)
> > > 3) dictating system
> > > 4) continuous speech transcription
> > > 5) speech recognition and understanding
> > >
> > > Your proposal will need implement most of the code from 4) or 5) to be
> > really usable and it has to be reliable.
> > >
> > > State of the art voice recognition systems use hidden Markov models to
> > detect phonemes. Phoneme is searched by traversing state diagram by
> > evaluating multiple recorded spectra. The phoneme is soft-decoded.
> Output of
> > the classifier is a list of phonemes with their probabilities of
> detection
> > assigned. To cope with phoneme smearing on their boundaries, either
> > sub-phonemes or phoneme pairs need to be detected.
> > >
> > > After the phonemes are classified, they are chained into words.
> Depending
> > on the dictionary, most probable words are picked. You suppose that your
> > system will not need it. But the trouble are consonants. They carry much
> > less energy than vowels and are much easier to be confused. Dictionary
> is
> > used to pick some second highest probability detected consonants in the
> > word. Not only the dictionary, but also the phoneme classifier is
> language
> > dependent.
> > >
> > > I think human brain works in the same way. Imagine learning foreign
> > language. Even if you are able to recognize slowly pronounced words, you
> > will be unable to pick them in a fast pronounced sentence. The word will
> > sound different. Human needs considerable training to understand a
> language.
> > You could decrease complexity of the decoder by constraining the
> detection
> > to slowly dictated separate words.
> > >
> > > If you simply pick the high probability phoneme, you will experience
> > comprehension problems of people with hearing loss. Oh yes, I am
> currently
> > working for hearing instrument manufacturer (I have nothing to do with
> > merck.com).
> > >
> > > from http://www.merck.com/mmhe/sec19/ch218/ch218a.html
> > > > Loss of the ability to hear high-pitched sounds often makes it more
> > difficult to understand speech. Although the loudness of speech appears
> > normal to the person, certain consonant sounds—such as the sound of
> letters
> > C, D, K, P, S, and T—become hard to distinguish, so that many people
> with
> > hearing loss think the speaker is mumbling. Words can be misinterpreted.
> For
> > example, a person may hear "bone" when the speaker said "stone."
> > >
> > > For me, it would be very irritating to dictate slowly to a system
> knowing
> > it will add some mumbling and not even having feedback about the errors
> the
> > recognizer does. From my perspective, before good voice recognition
> systems
> > are known, it is reasonable to stick to keyboard for extremely low bit
> > rates. If you would like to experiment, there are lot of open source
> voice
> > recognition packages. I am sure you could hack it to output the most
> > probable phoneme detected and you may try yourself, whether the result
> will
> > be intelligible or not. You do not need the sound generating system for
> that
> > experiment, it is quite easy to read the written phonemes. After you
> have a
> > good phoneme detector, the rest of your proposed software package is a
> piece
> > of cake.
> > >
> > > I am afraid I will disappoint you. I do not contemn your work. I found
> > couple of nice ideas in your text. I like the idea to setup the varicode
> > table to code similarly sounding phonemes by neighbor codes and to code
> > phoneme length by filling gaps in the data stream by a special code. But
> I
> > would propose you to read text book on voice recognition not to reinvent
> the
> > wheel.
> > >
> > > 73 and GL, Vojtech OK1IAK
> > >
> > >
> >
> >
>
> --
>
> Regards, Robert Thompson
>
> ====================================================
> ~ Concise, Complete, Correct: Pick Two
> ~ Faster, Cheaper, Better: Pick Two
> ~ Pervasive, Powerful, Trustworthy: Pick One
> ~ "Whom the computers would destroy, they first drive mad."
> ~ -- Anonymous
> ====================================================
>  
>

Reply via email to