There are several (military/gov) standard intelligibility tests that
do a pretty good job of scoring what most humans can and can not
reliably understand. You might try taking a look at them to get some
ideas of which voice characteristics make the most difference to
intelligibility. There is actually a surprising amount of data out
there, especially if you include the data peripheral to the various
computerized speech translator research projects. It's not *exactly*
signal processing... but understanding what parts of the signal matter
the most can be surprisingly helpful. This may be unusually
productive, because as of yet there hasn't been a huge amount of
cross-discipline work between the codec researchers and the
speech-to-meaning researchers. While there's a lot of duplicate
research in there, it tends to be from slightly different
perspectives, and the "stereo view" can sometimes help.



On Nov 18, 2007 9:12 AM, Mike Lebo <[EMAIL PROTECTED]> wrote:
>
>  Hi Vojtech,
>
> Thank you for your reply to my papers. I will do more work on the phonemes.
> The project I want to do uses new computers that were no available 10 years
> ago. Every 10 mS a decision is made to send a one or a zero. To make that
> decision I have 68 parallel FFT's running in the background. I believe the
> brain could handle mispronounce words better than you think.
>
> Mike
>
>
> On Nov 17, 2007 3:55 PM, r_lwesterfield <[EMAIL PROTECTED]>
> wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > I have a few radios (ARC-210-1851, PSC-5D, PRC-117F) at work that operate
> in MELP for a vocoder – Mixed Excitation Linear Prediction.  We have found
> MELP to be superior (more human-like voice qualities – less Charlie Brown's
> teacher) to LPC-10 but we use far larger bandwidths than 100 khz.  I do not
> know how well any of this will play out at such a narrow bandwidth.
> Listening to Charlie Brown's teacher will send you running away quickly and
> you should think of your listeners . . . they will tire very quickly.  Just
> because voice can be sent at such narrower bandwidths does not necessarily
> mean that people will like to listen to it.
> >
> >
> >
> > Rick – KH2DF
> >
> >
> >
> > ________________________________
>
> >
> > From: digitalradio@yahoogroups.com [mailto:[EMAIL PROTECTED]
> On Behalf Of Vojtech Bubník
> > Sent: Saturday, November 17, 2007 9:11 AM
> > To: [EMAIL PROTECTED]; digitalradio@yahoogroups.com
> > Subject: [digitalradio] Re: digital voice within 100 Hz bandwidth
> >
> >
> >
> >
> >
> >
> >
> > Hi Mike.
> >
> > I studied some aspects of voice recognition about 10 years ago when I
> thought of joining a research group at Czech Technical University in Prague.
> I have a 260 pages text book on my book shelf on voice recognition.
> >
> > Voice signal has high redundancy if compared to a text transcription. But
> there is additional information stored in the voice signal like pitch,
> intonation, speed. One could estimate for example mood of the speaker from
> the utterance.
> >
> > Voice tract could be described by a generator (tone for vowels, hiss for
> consonants) and filter. Translating voice into generator and filter
> coefficients greatly decreases voice data redundancy. This is roughly the
> technique that the common voice codecs do. GSM voice compression is a kind
> of Algebraic Code Excited Linear Prediction. Another interesting codec is
> AMBE (Advanced Multi-Band Excitation) used by DSTAR system. GSM half-rate
> codec squeezes voice to 5.6kbit/sec, AMBE to 3.6 kbps. Both systems use
> excitation tables, but AMBE is more efficient and closed source. I think the
> clue to the efficiency is in size and quality of the excitation tables. To
> create such an algorithm requires considerable amount of research and data
> analysis. The intelligibility of GSM or AMBE codecs is very good. You could
> buy the intelectual property of the AMBE codec by buying the chip. There are
> couple of projects running trying to built DSTAR into legacy transceivers.
> >
> > About 10 years ago we at OK1KPI club experimented with an echolink like
> system. We modified speakfreely software to control FM transceiver and we
> added web interface to control tuning and subtone of the transceiver. It was
> a lot of fun and a very unique system at that time.
> http://www.speakfreely.org/ The best compression factor offers LPC-10 codec
> (3460kbps), but the sound is very robot-like and quite hard to understand.
> At the end we reverted to GSM. I think IVOX is a variant of the LPC system
> that we tried.
> >
> > Your proposal is to increase compression rate by transmitting phonemes. I
> once had the same idea, but I quickly rejected it. Although it may be a nice
> exercise, I find it not very useless until good continuous speech
> multi-speaker multi-language recognition systems are available. I will try
> to explain my reasoning behind that statement.
> >
> > Let's classify voice recognition systems by the implementation complexity:
> > 1) Single-speaker, limited set of utterances recognized (control your
> desktop by voice)
> > 2) Multiple-speaker, limited set of utterances recognized (automated phone
> system)
> > 3) dictating system
> > 4) continuous speech transcription
> > 5) speech recognition and understanding
> >
> > Your proposal will need implement most of the code from 4) or 5) to be
> really usable and it has to be reliable.
> >
> > State of the art voice recognition systems use hidden Markov models to
> detect phonemes. Phoneme is searched by traversing state diagram by
> evaluating multiple recorded spectra. The phoneme is soft-decoded. Output of
> the classifier is a list of phonemes with their probabilities of detection
> assigned. To cope with phoneme smearing on their boundaries, either
> sub-phonemes or phoneme pairs need to be detected.
> >
> > After the phonemes are classified, they are chained into words. Depending
> on the dictionary, most probable words are picked. You suppose that your
> system will not need it. But the trouble are consonants. They carry much
> less energy than vowels and are much easier to be confused. Dictionary is
> used to pick some second highest probability detected consonants in the
> word. Not only the dictionary, but also the phoneme classifier is language
> dependent.
> >
> > I think human brain works in the same way. Imagine learning foreign
> language. Even if you are able to recognize slowly pronounced words, you
> will be unable to pick them in a fast pronounced sentence. The word will
> sound different. Human needs considerable training to understand a language.
> You could decrease complexity of the decoder by constraining the detection
> to slowly dictated separate words.
> >
> > If you simply pick the high probability phoneme, you will experience
> comprehension problems of people with hearing loss. Oh yes, I am currently
> working for hearing instrument manufacturer (I have nothing to do with
> merck.com).
> >
> > from http://www.merck.com/mmhe/sec19/ch218/ch218a.html
> > > Loss of the ability to hear high-pitched sounds often makes it more
> difficult to understand speech. Although the loudness of speech appears
> normal to the person, certain consonant sounds—such as the sound of letters
> C, D, K, P, S, and T—become hard to distinguish, so that many people with
> hearing loss think the speaker is mumbling. Words can be misinterpreted. For
> example, a person may hear "bone" when the speaker said "stone."
> >
> > For me, it would be very irritating to dictate slowly to a system knowing
> it will add some mumbling and not even having feedback about the errors the
> recognizer does. From my perspective, before good voice recognition systems
> are known, it is reasonable to stick to keyboard for extremely low bit
> rates. If you would like to experiment, there are lot of open source voice
> recognition packages. I am sure you could hack it to output the most
> probable phoneme detected and you may try yourself, whether the result will
> be intelligible or not. You do not need the sound generating system for that
> experiment, it is quite easy to read the written phonemes. After you have a
> good phoneme detector, the rest of your proposed software package is a piece
> of cake.
> >
> > I am afraid I will disappoint you. I do not contemn your work. I found
> couple of nice ideas in your text. I like the idea to setup the varicode
> table to code similarly sounding phonemes by neighbor codes and to code
> phoneme length by filling gaps in the data stream by a special code. But I
> would propose you to read text book on voice recognition not to reinvent the
> wheel.
> >
> > 73 and GL, Vojtech OK1IAK
> >
> >
>
>  



-- 

Regards, Robert Thompson

====================================================
~   Concise, Complete, Correct: Pick Two
~   Faster, Cheaper, Better: Pick Two
~   Pervasive, Powerful, Trustworthy: Pick One
~        "Whom the computers would destroy, they first drive mad."
~                            -- Anonymous
====================================================

Reply via email to