Oops, sent too quickly. What I meant was: "That ( speech recognition not being real time) is not entirely true." There are many commercial packages that do minimal-lag "realtime" speech recognition. One example would be the voice command features built into Apple's OSX. Another would be any one of a number of speech-to-text transcription packages.
I apologize if my unsupported and abrupt original phrasing appeared to be inflammatory. Such was not intended. On Nov 18, 2007 2:11 PM, Robert Thompson <[EMAIL PROTECTED]> wrote: > That is not entirely true. Besides, I wasn't focusing so much on their > "real" research as the voice characterization research that they had > to do before they could usefully work on recognition. It turns out > that the very areas that are most necessary for digital voice > recognition are the ones most necessary for human brains to recognize > and interpret. Voice is a mixed-information-density signal, and if you > "simplify" the signal by filtering out and discarding the less > necessary elements, you have significantly reduced the effort the next > stage has to do, whether it's digital encoding or speech recognition. > > > > On Nov 18, 2007 1:31 PM, Mike Lebo <[EMAIL PROTECTED]> wrote: > > > > Robert, > > > > I agree. The thing that is different is that speech recognition is not real > > time. Voice over the radio is real time. > > > > Mike n6ief > > > > > > > > On Nov 18, 2007 10:46 AM, Robert Thompson < [EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There are several (military/gov) standard intelligibility tests that > > > do a pretty good job of scoring what most humans can and can not > > > reliably understand. You might try taking a look at them to get some > > > ideas of which voice characteristics make the most difference to > > > intelligibility. There is actually a surprising amount of data out > > > there, especially if you include the data peripheral to the various > > > computerized speech translator research projects. It's not *exactly* > > > signal processing... but understanding what parts of the signal matter > > > the most can be surprisingly helpful. This may be unusually > > > productive, because as of yet there hasn't been a huge amount of > > > cross-discipline work between the codec researchers and the > > > speech-to-meaning researchers. While there's a lot of duplicate > > > research in there, it tends to be from slightly different > > > perspectives, and the "stereo view" can sometimes help. > > > > > > > > > > > > > > > On Nov 18, 2007 9:12 AM, Mike Lebo <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi Vojtech, > > > > > > > > Thank you for your reply to my papers. I will do more work on the > > phonemes. > > > > The project I want to do uses new computers that were no available 10 > > years > > > > ago. Every 10 mS a decision is made to send a one or a zero. To make > > that > > > > decision I have 68 parallel FFT's running in the background. I believe > > the > > > > brain could handle mispronounce words better than you think. > > > > > > > > Mike > > > > > > > > > > > > On Nov 17, 2007 3:55 PM, r_lwesterfield <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have a few radios (ARC-210-1851, PSC-5D, PRC-117F) at work that > > operate > > > > in MELP for a vocoder – Mixed Excitation Linear Prediction. We have > > found > > > > MELP to be superior (more human-like voice qualities – less Charlie > > Brown's > > > > teacher) to LPC-10 but we use far larger bandwidths than 100 khz. I do > > not > > > > know how well any of this will play out at such a narrow bandwidth. > > > > Listening to Charlie Brown's teacher will send you running away quickly > > and > > > > you should think of your listeners . . . they will tire very quickly. > > Just > > > > because voice can be sent at such narrower bandwidths does not > > necessarily > > > > mean that people will like to listen to it. > > > > > > > > > > > > > > > > > > > > Rick – KH2DF > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > > > > > > > > > From: digitalradio@yahoogroups.com > > [mailto:[EMAIL PROTECTED] > > > > On Behalf Of Vojtech Bubník > > > > > Sent: Saturday, November 17, 2007 9:11 AM > > > > > To: [EMAIL PROTECTED]; digitalradio@yahoogroups.com > > > > > Subject: [digitalradio] Re: digital voice within 100 Hz bandwidth > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Mike. > > > > > > > > > > I studied some aspects of voice recognition about 10 years ago when I > > > > thought of joining a research group at Czech Technical University in > > Prague. > > > > I have a 260 pages text book on my book shelf on voice recognition. > > > > > > > > > > Voice signal has high redundancy if compared to a text transcription. > > But > > > > there is additional information stored in the voice signal like pitch, > > > > intonation, speed. One could estimate for example mood of the speaker > > from > > > > the utterance. > > > > > > > > > > Voice tract could be described by a generator (tone for vowels, hiss > > for > > > > consonants) and filter. Translating voice into generator and filter > > > > coefficients greatly decreases voice data redundancy. This is roughly > > the > > > > technique that the common voice codecs do. GSM voice compression is a > > kind > > > > of Algebraic Code Excited Linear Prediction. Another interesting codec > > is > > > > AMBE (Advanced Multi-Band Excitation) used by DSTAR system. GSM > > half-rate > > > > codec squeezes voice to 5.6kbit/sec, AMBE to 3.6 kbps. Both systems use > > > > excitation tables, but AMBE is more efficient and closed source. I think > > the > > > > clue to the efficiency is in size and quality of the excitation tables. > > To > > > > create such an algorithm requires considerable amount of research and > > data > > > > analysis. The intelligibility of GSM or AMBE codecs is very good. You > > could > > > > buy the intelectual property of the AMBE codec by buying the chip. There > > are > > > > couple of projects running trying to built DSTAR into legacy > > transceivers. > > > > > > > > > > About 10 years ago we at OK1KPI club experimented with an echolink > > like > > > > system. We modified speakfreely software to control FM transceiver and > > we > > > > added web interface to control tuning and subtone of the transceiver. It > > was > > > > a lot of fun and a very unique system at that time. > > > > http://www.speakfreely.org/ The best compression factor offers LPC-10 > > codec > > > > (3460kbps), but the sound is very robot-like and quite hard to > > understand. > > > > At the end we reverted to GSM. I think IVOX is a variant of the LPC > > system > > > > that we tried. > > > > > > > > > > Your proposal is to increase compression rate by transmitting > > phonemes. I > > > > once had the same idea, but I quickly rejected it. Although it may be a > > nice > > > > exercise, I find it not very useless until good continuous speech > > > > multi-speaker multi-language recognition systems are available. I will > > try > > > > to explain my reasoning behind that statement. > > > > > > > > > > Let's classify voice recognition systems by the implementation > > complexity: > > > > > 1) Single-speaker, limited set of utterances recognized (control your > > > > desktop by voice) > > > > > 2) Multiple-speaker, limited set of utterances recognized (automated > > phone > > > > system) > > > > > 3) dictating system > > > > > 4) continuous speech transcription > > > > > 5) speech recognition and understanding > > > > > > > > > > Your proposal will need implement most of the code from 4) or 5) to be > > > > really usable and it has to be reliable. > > > > > > > > > > State of the art voice recognition systems use hidden Markov models to > > > > detect phonemes. Phoneme is searched by traversing state diagram by > > > > evaluating multiple recorded spectra. The phoneme is soft-decoded. > > Output of > > > > the classifier is a list of phonemes with their probabilities of > > detection > > > > assigned. To cope with phoneme smearing on their boundaries, either > > > > sub-phonemes or phoneme pairs need to be detected. > > > > > > > > > > After the phonemes are classified, they are chained into words. > > Depending > > > > on the dictionary, most probable words are picked. You suppose that your > > > > system will not need it. But the trouble are consonants. They carry much > > > > less energy than vowels and are much easier to be confused. Dictionary > > is > > > > used to pick some second highest probability detected consonants in the > > > > word. Not only the dictionary, but also the phoneme classifier is > > language > > > > dependent. > > > > > > > > > > I think human brain works in the same way. Imagine learning foreign > > > > language. Even if you are able to recognize slowly pronounced words, you > > > > will be unable to pick them in a fast pronounced sentence. The word will > > > > sound different. Human needs considerable training to understand a > > language. > > > > You could decrease complexity of the decoder by constraining the > > detection > > > > to slowly dictated separate words. > > > > > > > > > > If you simply pick the high probability phoneme, you will experience > > > > comprehension problems of people with hearing loss. Oh yes, I am > > currently > > > > working for hearing instrument manufacturer (I have nothing to do with > > > > merck.com). > > > > > > > > > > from http://www.merck.com/mmhe/sec19/ch218/ch218a.html > > > > > > Loss of the ability to hear high-pitched sounds often makes it more > > > > difficult to understand speech. Although the loudness of speech appears > > > > normal to the person, certain consonant sounds—such as the sound of > > letters > > > > C, D, K, P, S, and T—become hard to distinguish, so that many people > > with > > > > hearing loss think the speaker is mumbling. Words can be misinterpreted. > > For > > > > example, a person may hear "bone" when the speaker said "stone." > > > > > > > > > > For me, it would be very irritating to dictate slowly to a system > > knowing > > > > it will add some mumbling and not even having feedback about the errors > > the > > > > recognizer does. From my perspective, before good voice recognition > > systems > > > > are known, it is reasonable to stick to keyboard for extremely low bit > > > > rates. If you would like to experiment, there are lot of open source > > voice > > > > recognition packages. I am sure you could hack it to output the most > > > > probable phoneme detected and you may try yourself, whether the result > > will > > > > be intelligible or not. You do not need the sound generating system for > > that > > > > experiment, it is quite easy to read the written phonemes. After you > > have a > > > > good phoneme detector, the rest of your proposed software package is a > > piece > > > > of cake. > > > > > > > > > > I am afraid I will disappoint you. I do not contemn your work. I found > > > > couple of nice ideas in your text. I like the idea to setup the varicode > > > > table to code similarly sounding phonemes by neighbor codes and to code > > > > phoneme length by filling gaps in the data stream by a special code. But > > I > > > > would propose you to read text book on voice recognition not to reinvent > > the > > > > wheel. > > > > > > > > > > 73 and GL, Vojtech OK1IAK > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Regards, Robert Thompson > > > > > > ==================================================== > > > ~ Concise, Complete, Correct: Pick Two > > > ~ Faster, Cheaper, Better: Pick Two > > > ~ Pervasive, Powerful, Trustworthy: Pick One > > > ~ "Whom the computers would destroy, they first drive mad." > > > ~ -- Anonymous > > > ==================================================== > > > > > > > > > > > > > > -- > > Regards, Robert Thompson > > ==================================================== > ~ Concise, Complete, Correct: Pick Two > ~ Faster, Cheaper, Better: Pick Two > ~ Pervasive, Powerful, Trustworthy: Pick One > ~ "Whom the computers would destroy, they first drive mad." > ~ -- Anonymous > ==================================================== > -- Regards, Robert Thompson ==================================================== ~ Concise, Complete, Correct: Pick Two ~ Faster, Cheaper, Better: Pick Two ~ Pervasive, Powerful, Trustworthy: Pick One ~ "Whom the computers would destroy, they first drive mad." ~ -- Anonymous ====================================================