Hi Vojtech, Thank you for your reply to my papers. I will do more work on the phonemes. The project I want to do uses new computers that were no available 10 years ago. Every 10 mS a decision is made to send a one or a zero. To make that decision I have 68 parallel FFT's running in the background. I believe the brain could handle mispronounce words better than you think.
Mike On Nov 17, 2007 3:55 PM, r_lwesterfield <[EMAIL PROTECTED]> wrote: > I have a few radios (ARC-210-1851, PSC-5D, PRC-117F) at work that > operate in MELP for a vocoder – Mixed Excitation Linear Prediction. We have > found MELP to be superior (more human-like voice qualities – less Charlie > Brown's teacher) to LPC-10 but we use far larger bandwidths than 100 khz. I > do not know how well any of this will play out at such a narrow bandwidth. > Listening to Charlie Brown's teacher will send you running away quickly and > you should think of your listeners . . . they will tire very quickly. Just > because voice can be sent at such narrower bandwidths does not necessarily > mean that people will like to listen to it. > > > > Rick – KH2DF > > > ------------------------------ > > *From:* digitalradio@yahoogroups.com [mailto:[EMAIL PROTECTED] > *On Behalf Of *Vojtech Bubník > *Sent:* Saturday, November 17, 2007 9:11 AM > *To:* [EMAIL PROTECTED]; digitalradio@yahoogroups.com > *Subject:* [digitalradio] Re: digital voice within 100 Hz bandwidth > > > > Hi Mike. > > I studied some aspects of voice recognition about 10 years ago when I > thought of joining a research group at Czech Technical University in Prague. > I have a 260 pages text book on my book shelf on voice recognition. > > Voice signal has high redundancy if compared to a text transcription. But > there is additional information stored in the voice signal like pitch, > intonation, speed. One could estimate for example mood of the speaker from > the utterance. > > Voice tract could be described by a generator (tone for vowels, hiss for > consonants) and filter. Translating voice into generator and filter > coefficients greatly decreases voice data redundancy. This is roughly the > technique that the common voice codecs do. GSM voice compression is a kind > of Algebraic Code Excited Linear Prediction. Another interesting codec is > AMBE (Advanced Multi-Band Excitation) used by DSTAR system. GSM half-rate > codec squeezes voice to 5.6kbit/sec, AMBE to 3.6 kbps. Both systems use > excitation tables, but AMBE is more efficient and closed source. I think the > clue to the efficiency is in size and quality of the excitation tables. To > create such an algorithm requires considerable amount of research and data > analysis. The intelligibility of GSM or AMBE codecs is very good. You could > buy the intelectual property of the AMBE codec by buying the chip. There are > couple of projects running trying to built DSTAR into legacy transceivers. > > About 10 years ago we at OK1KPI club experimented with an echolink like > system. We modified speakfreely software to control FM transceiver and we > added web interface to control tuning and subtone of the transceiver. It was > a lot of fun and a very unique system at that time. > http://www.speakfreely.org/ The best compression factor offers LPC-10 > codec (3460kbps), but the sound is very robot-like and quite hard to > understand. At the end we reverted to GSM. I think IVOX is a variant of the > LPC system that we tried. > > Your proposal is to increase compression rate by transmitting phonemes. I > once had the same idea, but I quickly rejected it. Although it may be a nice > exercise, I find it not very useless until good continuous speech > multi-speaker multi-language recognition systems are available. I will try > to explain my reasoning behind that statement. > > Let's classify voice recognition systems by the implementation complexity: > 1) Single-speaker, limited set of utterances recognized (control your > desktop by voice) > 2) Multiple-speaker, limited set of utterances recognized (automated phone > system) > 3) dictating system > 4) continuous speech transcription > 5) speech recognition and understanding > > Your proposal will need implement most of the code from 4) or 5) to be > really usable and it has to be reliable. > > State of the art voice recognition systems use hidden Markov models to > detect phonemes. Phoneme is searched by traversing state diagram by > evaluating multiple recorded spectra. The phoneme is soft-decoded. Output of > the classifier is a list of phonemes with their probabilities of detection > assigned. To cope with phoneme smearing on their boundaries, either > sub-phonemes or phoneme pairs need to be detected. > > After the phonemes are classified, they are chained into words. Depending > on the dictionary, most probable words are picked. You suppose that your > system will not need it. But the trouble are consonants. They carry much > less energy than vowels and are much easier to be confused. Dictionary is > used to pick some second highest probability detected consonants in the > word. Not only the dictionary, but also the phoneme classifier is language > dependent. > > I think human brain works in the same way. Imagine learning foreign > language. Even if you are able to recognize slowly pronounced words, you > will be unable to pick them in a fast pronounced sentence. The word will > sound different. Human needs considerable training to understand a language. > You could decrease complexity of the decoder by constraining the detection > to slowly dictated separate words. > > If you simply pick the high probability phoneme, you will experience > comprehension problems of people with hearing loss. Oh yes, I am currently > working for hearing instrument manufacturer (I have nothing to do with > merck.com). > > from http://www.merck.com/mmhe/sec19/ch218/ch218a.html > > Loss of the ability to hear high-pitched sounds often makes it more > difficult to understand speech. Although the loudness of speech appears > normal to the person, certain consonant sounds—such as the sound of letters > C, D, K, P, S, and T—become hard to distinguish, so that many people with > hearing loss think the speaker is mumbling. Words can be misinterpreted. For > example, a person may hear "bone" when the speaker said "stone." > > For me, it would be very irritating to dictate slowly to a system knowing > it will add some mumbling and not even having feedback about the errors the > recognizer does. From my perspective, before good voice recognition systems > are known, it is reasonable to stick to keyboard for extremely low bit > rates. If you would like to experiment, there are lot of open source voice > recognition packages. I am sure you could hack it to output the most > probable phoneme detected and you may try yourself, whether the result will > be intelligible or not. You do not need the sound generating system for that > experiment, it is quite easy to read the written phonemes. After you have a > good phoneme detector, the rest of your proposed software package is a piece > of cake. > > I am afraid I will disappoint you. I do not contemn your work. I found > couple of nice ideas in your text. I like the idea to setup the varicode > table to code similarly sounding phonemes by neighbor codes and to code > phoneme length by filling gaps in the data stream by a special code. But I > would propose you to read text book on voice recognition not to reinvent the > wheel. > > 73 and GL, Vojtech OK1IAK > > >