Will - thanks. I think I have grasped the basics now
Ian -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 24 February 2007 20:10 To: Ian Pascoe Cc: Ubuntu-accessibility@lists.ubuntu.com Subject: Re: FW: Voice Recognition for Linux Here's a paper that describes how Sphinx-4 works: http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4Whitepaper.pdf Hope this helps, Will Ian Pascoe wrote: > Hi Henrik / Eric > > Although I don't want to get embroilled in your discussions I have a > question to ask on this. > > How does voice recognition work - does it use word parts as in a TTS engine > like eSpeak but in reverse, or does it maintain a dictionary of actual > words? > > I presume that the problems you are corresponding about is not the way the > STT engine works but the way it interprets the input? > > Fascinating discussions - thanks > > Ian > > Ian > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Eric > S. Johansson > Sent: 23 February 2007 19:45 > To: Henrik Nilsen Omma > Cc: Ubuntu Accessibility Mailing List > Subject: Re: Voice Recognition for Linux > > > Henrik Nilsen Omma wrote: >> Eric S. Johansson wrote: > >> Looks like the original text got caught in a spam filter somewhere >> because of the attachment (I found it in the web archives). No worries >> about the tone. We are having a frank technical discussion and need to >> speak directly to get our points across. So my turn :) ... > > Thanks for the understanding but it always helps to be polite. >> I think you are too caught up in the current working model of NS to see >> how things can be done differently. > > you haven't seen the comments I've made in the past about speech user > interfaces and what Dragon has done wrong. I have proposed many things > that should be fixed but the current command model is not one of them >> I have not studied the details of voice recognition and voice models, >> ... but I do appreciate the need for custom voice model training over time. >> There is a need for feedback, but it does _not_ need to be real-time. >> Personally, I would prefer it not to be real time. NS does in theory >> tout this as a feature when they claim that you can record speech on a >> voice recorder and dump it into NS for transcription. I have no idea >> whether that actually works. > > okay, I should probably attempt to capture some of the user experience > issues. > > Correction from its recognitions is something people debate a lot. If > you don't correct miss recognitions, you'll most likely get the same > thing over and over again. The output of the language and recognition > model is probabilistic so makes recognitions will change from time to > time but it'll basically be the same kind misrecognition. (Yes, all > uncorrected). > > The user is then faced with a choice, do you correct the recognition > engine or do you edit the document? In both cases, it's painful. But > then you get the odd case with the misrecognition is completely > unintelligible and you don't have any idea what the hell you said. Then > you have no choice but to go back and listen to what was said at that > phrase and make a correction. This is a very real user experience. I > have spoken with people who write documents in Microsoft Word and > they'll go back to page 5 out of 20 see something that's garbled and > play it back so they can figure out what they said. They usually don't > correct heavy garbling but just say it again and get a more consistent > recognition from that point forward courtesy the incremental training. > > In theory, you can dictate into most applications using something called > natural text. It's a direct text injection with a history of what was > said (audio and recognition). You can do limited correction by > Select-and-Say and it even sort of kind of works if it's a full native > Microsoft Windows application. Tools like Thunderbird, gaim, Emacs > don't work so well. How they feel is for later discussion. > > But you have this nice tool, that's almost right, called the dictation > box. It's a little window which has full editing a correction > capability using the voice model of NaturallySpeaking. When you are > done with your dictation, you can inject that into the application its > associated with. The wonderful thing about the dictation box is that > making corrections significantly improves accuracy. If I dictated > nothing but the dictation box for a week, I would have a significantly > more accurate system and a lower level of frustration on > misrecognition's. If I had what ever magic dictation box uses on all of > my applications, I would be ecstatic. I wouldn't need to retrain every > six months. But it's not sufficient. Why is again conversation for a > future time. > > If you want to migrate away from incremental recognition, you'll need to > look to NaturallySpeaking 3 or NaturallySpeaking 4 for the user > experience. You would probably lose one to 2% (or more) on the accuracy > which is really significant. Believe me, there's a huge difference > between 99% and 99.5% recognition accuracy in actual operating > conditions. It's also important to note that dragon changed from the > incremental correction model a couple of times. The last time I was in > touch with dragon employees (before the bakers got greedy), they will > really convinced incremental training, properly done, gave a > significantly better user experience and I would have to say from what I > hear and from what I have experienced, I think they were right. Maybe > they were drinking their own Kool-Aid, maybe they were onto something. > I am no stranger to figuring out interesting ways to get the signals you > need to do something right so I trust them. > > But independent of your desire, you may not be able to turn it off. You > may have users who know how it works making your life uncomfortable > because you have made their life less pleasant. You will have me > demanding the highest possible accuracy. :-) > > I think at this point it would be a really good idea for you to go > purchase a copy of NaturallySpeaking 9 preferred. Get a really good > headset. The one that comes in the box is a piece of crap. No > seriously, it's really bad. I can give you some recommendations on > headsets (xvi mostly) but I really really love my vxi Bluetooth wireless > headset. It is just so sweet. It has some flaws but it's really sweet too. > >> I don't really want to interact with the voice engine all the time, I >> want it to mostly stay out of my way. I don't want to look at the little >> voice level bar when I'm speaking or read the early guesses of the voice >> engine. I want to look out the window or look at the spreadsheet that >> I'm writing an email about :) The fact that NS updates the voice model >> incrementally is actually a bad feature. I don't want that. If I have a >> cold one day or there is noise outside or the mic is a bit displaced the >> profile gets damaged. That's probably why you have to start a fresh one >> every six months. > > Can you use your keyboard without the delete or backspace key? Or even > the arrow keys? the correction dialog I'm talking about is as core to > your daily operation as those keys are. As for changing focus, sure, > you can do it but only if you have an application which is sufficiently > speech aware to record your audio track at the same time and be able to > play back a segment you think is an error. It's the only way you'll > make corrections unless you have a memory which is a few orders of > magnitude better than mine. > > I should also note that if you don't have a clear and accurate > indication of what's a misrecognition error, correcting something that > is right can make your user model go back quickly. at least so I am > told. Of course, I've never done anything like that, no, no way. Uh-huh. > > >> Instead of saving my voice profile every day, I would like to save up a >> log of all the mistakes that were made during the week. I would then sit >> down for a session of training to help NS cope with those words and >> phrases better. I would first take a backup of my voice profile, then >> say a few sample sentences to make sure everything was generally working >> OK. I would then read passages from the log and do the needed correction >> and re-training. I would save the profile and start using the new one >> for the next week. I would also save profiles going back four weeks, and >> once a month I would do a brief test with the stored up profiles to see >> if it had degraded over time. If it had, I would roll back to an older >> one and perhaps do some training from recent logs too. There is no >> reason a voice profile should just automatically go bad over time. > > now you're thinking like a geek. Ordinary users eventually learn when > to save a profile based on the type and number of corrections they make. > They don't test them, they just save them and count on the system to > automatically backup every few saves. I don't save mine every day and I > only save my profile when I correct really persistent is recognitions. > If I'm getting a cold or hay fever, I definitely don't save but I also > suffer from reduced recognition for a few days. > > user reluctance to put in the effort is reason why you train on a > document once at the beginning. I usually choose a couple different > documents to train on after a month on a new model but I am a rarity. I > described this behavior in a white paper I wrote called "spam filters > are like dogs". You have expert trainers and you have people whose dogs > crap on the neighbors lawns. Same category of animals, with roughly the > same skill potential but very different training models. Naturally > speaking is try to take advantage of the "less formal" behaviors for > training and they're doing a pretty good job at succeeding with those > signals. > > Don't force the ordinary user to train at an expert level. It won't > work, it will just piss them off, and it will discourage if not drive > away the moderately expert user who wants to work in the way they are > comfortable. >> The fact that you have to constantly interact with the voice engine is >> not a feature, it's a bug! It's just that you have adapted your >> dictation to work around it. It's not at all clear that interactive >> correction is better that batched correction. It certainly should not be >> seen as a blocker for a project like this going forward. I wouldn't want >> to spend years on a project simply to replicate NS on Linux. There is >> plenty of room for improvement in the current system. > > You constantly interact with your computer and except from it a bunch of > feedback. This is no different. In not looking at speech levels but > you may be looking at load averages, time of day, alerts about e-mail > coming in, cursor position in an editor buffer, color changes for syntax > highlighting. These are all forms of feedback. Incremental training > and looking at recognition sequences are just different forms of > feedback. He learned to incorporate it in your operation > > ("he learned" is a persistent misrecognition error that mostly shows up > when using natural text, because I'm not in a place where I can correct > it often enough, it keeps showing up if I was in dictation box right > now, it would be mostly gone. This is why incremental recognition > correction is so very very important. batch training has never made > this go away and I've tried. The only thing that has succeeded has been > incremental in one context.) > >> OK, now for some replies: > > you mean the above weren't enough? :-) > >>> There is a system that art exists that does exactly what you've >>> opposed. >> [assuming you meant 'proposed' here] Unlikely. If a system with the >> level of usability existed it would already be in widespread use. >> >>> While it was technically successful, it has failed in that nobody >>> but the originator uses it in even he admits this model has some >>> serious shortcomings. >>> >> What system, where? What was the model and what were the shortcomings? > > http://eepatents.com/ but the package is no longer visible. Ed took a > gun awhile ago. His package used xinput direct injection. He used a > Windows application with a window to receive the dictation information > and inject it into the virtual machine. he was able to do straight > injection of text limited by what NaturallySpeaking put out. I think he > did some character sequence translations but I'm not sure. He couldn't > control the mouse, couldn't shift Windows, had only global commands and > not application-specific commands. I could be wrong at some of these > points but that's basically what I remember. > > There was also a bunch of other stuff like, complicated to set up etc. > but that can be fixed relatively easily. Especially if you remove the > dependency on twisted. > > to my mind, it's the same as what you're proposing. And there is a > general agreement that it only a starting point for the very > committed/dedicated > >>> The reason I insist on feedback is very simple. A good speech >>> recognition environment lets you lets you correct recognition errors >>> and create application-specific and application neutral commands. >> Yes, we agree that you need correction. The application-specific >> features can be implemented in this model too it the same way that Orca >> uses scripting. > > Don't know how orca uses scripting. pointers? > > seriously though, I want a grammar and the ability to associate methods > with the grammar. I do know I'm not the only one because there is a > fair number of people that have built grammars using the > NaturallySpeaking Visual Basic environment, natpython and a couple macro > packages built on top of natpython. > > Even if you convince me, you'll have to convince them. > >> You would still have to correct the mistake at some point. I would >> prefer to just dictate on and come back and correct all the mistakes at >> the end. One should read through before sending in any case ;) > > Oh I understand but in my experience, if I don't pay attention to what > the recognition system is saying, by speech gets sloppy and by > recognition accuracy drops significantly until I have something which is > completely unrecognizable at the end. Also, I'm probably "special" in > this case but even when I was typing, I continually look back at the > document as far as the screen permits searching for errors. It seems to > help me keep speaking written speech and identifying where I'm using > spoken speech for writing. I know other people like you want to just > dictate and not look back. Some of them will turn their chair around > and stare at painting on the wall while they dictate. But there are > those, like me that can't. > > > And I think that is a serious design-flaw for two (related) reasons: It >> gradually corrupts you voice files AND it makes the reader constantly >> worry about whether that is happening. You have to make sure to speak as >> correctly as properly at all times and always make sure to stop >> immediately and correct all the mistakes. Otherwise your profile will be >> hosed. I repeat: that is a bug, not a feature. You end up adapting more >> to the machine than the machine adapts to you. *That is a bug.* > > It's a feature... seriously, get NaturallySpeaking, And play with the > dictation box as well as natural text driven applications. When you > have something that Select-and-Say enabled, you don't need to pay > attention all the time, you can go back a paragraph or two or three and > fix your errors. The only time you need to pay attention is when you > are using natural text which is one-way nuance forces you to toe the > line when it comes to applications. That is a bug! > > >> I think this is an NS bug too. I don't want natural editing, I only want >> natural dictation. I want two completely separate modes: pure dictation >> and pure editing. If I say 'cut that' I want the words 'cut that' to be >> typed. To edit I want to say: 'Hal: cut that bit'. Why? because that >> would improve overall recognition and would remove the worry that you >> might delete a paragraph by mistake. NS would only trigger it's special >> functions on a single word, and otherwise just do its best to >> transcribe. You would of course select that word to be one that it would >> never get wrong. (you could argue that natural editing is a feature, but >> the fact that you cannot easily configure it to use the modes I >> described is a design-flaw). > > A few things are very important in this paragraph. Prefacing a command > is something I will really fight against. It is a horrible thing to > impose on the user because it adds extra vocal load and cognitive load > on the user. Voice coder has a "yo" command model for certain commands > and I just refuse to use them. I type rather than say that sequence is > so repellent to me. I have also had significant experience with modal > commands with DragonDictate which is why I have such a strong reaction > against the command preface and this is why Dragon Systems went away > from them. Remember, technology dedicated company, I know for a fact > thatsome of the employees were quite smart. If Dragon's research group > does something and sticks with it, there's probably a good reason for it. > > I think part of our differences comes from modal versus nonmodal user > interfaces. I like Emacs, it's nonmodal (mostly) other people like VI > which is exceptionally modal. Non-modal user interfaces are preferred > in the circumstances if the indicator to activate some command or > different course of action is relatively natural. For example if I say > "don't show dictation box" I just get text. But if I say "show > dictation box" with a pause before the text as well as after, up comes > the dictation box. Same words, but the simple addition of natural > length pauses allows NaturallySpeaking to identify the command and > activate it only when it's asked for. Yes, it's training but minimal > training and it applies everywhere when separating commands from text. > This works for NaturallySpeaking commands and my private commands. > > there is one additional form of mode switching in NaturallySpeaking and > that's the switching of commands based on which program is active and > its state (i.e. running dialog boxes or something equivalent). That's > why I have Emacs commands There are only active when running Emacs. > >> Precisely. It's because they don't want to fiddle with the program, they >> just want to dictate. > > But those that just dictate, get unacceptable results. Try it. When > you get NaturallySpeaking running, just dictate and never ever correct > and see what happens. Then try it the other way around using dictation > box whenever possible. > ---eric > > > -- > Speech-recognition in use. It makes mistakes, I correct some. > > -- > Ubuntu-accessibility mailing list > Ubuntu-accessibility@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/ubuntu-accessibility > > > -- Ubuntu-accessibility mailing list Ubuntu-accessibility@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-accessibility