Hi Mike.

I studied some aspects of voice recognition about 10 years ago when I thought 
of joining a research group at Czech Technical University in Prague. I have a 
260 pages text book on my book shelf on voice recognition.

Voice signal has high redundancy if compared to a text transcription. But there 
is additional information stored in the voice signal like pitch, intonation, 
speed. One could estimate for example mood of the speaker from the utterance.

Voice tract could be described by a generator (tone for vowels, hiss for 
consonants) and filter. Translating voice into generator and filter 
coefficients greatly decreases voice data redundancy. This is roughly the 
technique that the common voice codecs do. GSM voice compression is a kind of 
Algebraic Code Excited Linear Prediction. Another interesting codec is AMBE 
(Advanced Multi-Band Excitation) used by DSTAR system. GSM half-rate codec 
squeezes voice to 5.6kbit/sec, AMBE to 3.6 kbps. Both systems use excitation 
tables, but AMBE is more efficient and closed source. I think the clue to the 
efficiency is in size and quality of the excitation tables. To create such an 
algorithm requires considerable amount of research and data analysis. The 
intelligibility of GSM or AMBE codecs is very good. You could buy the 
intelectual property of the AMBE codec by buying the chip. There are couple of 
projects running trying to built DSTAR into legacy transceivers.

About 10 years ago we at OK1KPI club experimented with an echolink like system. 
We modified speakfreely software to control FM transceiver and we added web 
interface to control tuning and subtone of the transceiver. It was a lot of fun 
and a very unique system at that time. http://www.speakfreely.org/ The best 
compression factor offers LPC-10 codec (3460kbps), but the sound is very 
robot-like and quite hard to understand. At the end we reverted to GSM. I think 
IVOX is a variant of the LPC system that we tried.

Your proposal is to increase compression rate by transmitting phonemes. I once 
had the same idea, but I quickly rejected it. Although it may be a nice 
exercise, I find it not very useless until good continuous speech multi-speaker 
multi-language recognition systems are available. I will try to explain my 
reasoning behind that statement.

Let's classify voice recognition systems by the implementation complexity:
1) Single-speaker, limited set of utterances recognized (control your desktop 
by voice)
2) Multiple-speaker, limited set of utterances recognized (automated phone 
system)
3) dictating system
4) continuous speech transcription
5) speech recognition and understanding

Your proposal will need implement most of the code from 4) or 5) to be really 
usable and it has to be reliable.

State of the art voice recognition systems use hidden Markov models to detect 
phonemes. Phoneme is searched by traversing state diagram by evaluating 
multiple recorded spectra. The phoneme is soft-decoded. Output of the 
classifier is a list of phonemes with their probabilities of detection 
assigned. To cope with phoneme smearing on their boundaries, either 
sub-phonemes or phoneme pairs need to be detected.

After the phonemes are classified, they are chained into words. Depending on 
the dictionary, most probable words are picked.  You suppose that your system 
will not need it. But the trouble are consonants. They carry much less energy 
than vowels and are much easier to be confused. Dictionary is used to pick some 
second highest probability detected consonants in the word. Not only the 
dictionary, but also the phoneme classifier is language dependent. 

I think human brain works in the same way. Imagine learning foreign language. 
Even if you are able to recognize slowly pronounced words, you will be unable 
to pick them in a fast pronounced sentence. The word will sound different. 
Human needs considerable training to understand a language. You could decrease 
complexity of the decoder by constraining the detection to slowly dictated 
separate words.

If you simply pick the high probability phoneme, you will experience 
comprehension problems of people with hearing loss. Oh yes, I am currently 
working for hearing instrument manufacturer (I have nothing to do with 
merck.com).

from http://www.merck.com/mmhe/sec19/ch218/ch218a.html
> Loss of the ability to hear high-pitched sounds often makes it more difficult 
> to understand speech. Although the loudness of speech appears normal to the 
> person, certain consonant sounds—such as the sound of letters C, D, K, P, S, 
> and T—become hard to distinguish, so that many people with hearing loss think 
> the speaker is mumbling. Words can be misinterpreted. For example, a person 
> may hear “bone” when the speaker said “stone.”

For me, it would be very irritating to dictate slowly to a system knowing it 
will add some mumbling and not even having feedback about the errors the 
recognizer does. From my perspective, before good voice recognition systems are 
known, it is reasonable to stick to keyboard for extremely low bit rates. If 
you would like to experiment, there are lot of open source voice recognition 
packages. I am sure you could hack it to output the most probable phoneme 
detected and you may try yourself, whether the result will be intelligible or 
not. You do not need the sound generating system for that experiment, it is 
quite easy to read the written phonemes. After you have a good phoneme 
detector, the rest of your proposed software package is a piece of cake.

I am afraid I will disappoint you. I do not contemn your work. I found couple 
of nice ideas in your text. I like the idea to setup the varicode table to code 
similarly sounding phonemes by neighbor codes and to code phoneme length by 
filling gaps in the data stream by a special code. But I would propose you to 
read text book on voice recognition not to reinvent the wheel.

73 and GL, Vojtech OK1IAK

Reply via email to