Re: [whatwg] Speech input element
On Tue, 15 Jun 2010 17:08:40 +0200, Satish Sampath sat...@google.com wrote: To add a little more clarity - we initially proposed a speech input API using a new input type=speech element. The top feedback we received was to extend speech as a form of input to existing elements instead of creating a new speech control. We have taken that into account in the new proposal which extends speech input to existing form elements and other editable elements. Please take a fresh look and share your thoughts. Could you maybe post a link to the proposal? Or in case you intended to attach it: it didn't get through. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Speech input element
Please see http://docs.google.com/View?id=dcfg79pz_5dhnp23f5 for the new proposal (Bjorn's earlier post had this link). Cheers Satish
Re: [whatwg] Speech input element
To add a little more clarity - we initially proposed a speech input API using a new input type=speech element. The top feedback we received was to extend speech as a form of input to existing elements instead of creating a new speech control. We have taken that into account in the new proposal which extends speech input to existing form elements and other editable elements. Please take a fresh look and share your thoughts. -- Cheers Satish
Re: [whatwg] Speech input element
From TFA: We would like some way of having speech control in a web application, without any input fields. For example, in a webmail client, there are buttons, links etc that lets the user take actions such as deleting or replying to email. We would like to make it easy to implement a speech interface where the user can say read next message, archive, reply etc, without having to show a text field where the same commands can be typed. link rel=next title=Read next message. form action=archive methord=POST title=Archive message !-- @method=MOVE? -- button type=submit /form -- kv, - Bjartur
Re: [whatwg] Speech input element
---From TFA--- A web search application can accept speech input, and perform a search immediately when the input is recognized. If it has access to the additional recognition hypothesis (aka N-best list), it can display that on the search results page and let the user chose the correct query if the input was misrecognized. For example, Google search might display search results for recognize speech, and show a link with the text Did you say 'wreck a nice beach'?. --- --- User-agents can submit any GET forms immediately. Also they may keep the form open for editing and list correction suggestions alongside as the form gets submitted and results are shown to the user. That shouldn't get mixed up with the results. -- kv, - Bjartur
Re: [whatwg] Speech input element
Based on the feedback in this thread we've worked out a new speech input proposal that adds a @speech attribute to most input elements, instead of a new input type=speech. Please see http://docs.google.com/View?id=dcfg79pz_5dhnp23f5 for the new proposal. /Bjorn Bringert Satish Sampath -- Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Speech input element
On Thu, May 20, 2010 at 1:32 PM, Anne van Kesteren ann...@opera.com wrote: On Thu, 20 May 2010 14:29:16 +0200, Bjorn Bringert bring...@google.com wrote: It should be possible to drive input type=speech with keyboard input, if the user agent chooses to implement that. Nothing in the API should require the user to actually speak. I think this is a strong argument for why input type=speech should not be replaced by a microphone API and a separate speech recognizer, since the latter would be very hard to make accessible. (I still think that there should be a microphone API for applications like audio chat, but that's a separate discussion). So why not implement speech support on top of the existing input types? Speech-driven keyboards certainly get you some of the benefits of input type=speech, but they give the application developer less control and less information than a speech-specific API. Some advantages of a dedicated speech input type: - Application-defined grammars. This is important for getting high recognition accuracy in with limited domains. - Allows continuous speech recognition where the app gets events on speech endpoints. - Multiple recognition hypotheses. This lets applications implement intelligent input disambiguation. - Doesn't require the input element to have keyboard focus while speaking. - Doesn't require a visible text input field. -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Speech input element
On Thu, 20 May 2010 14:18:56 +0100, Bjorn Bringert bring...@google.com wrote: On Thu, May 20, 2010 at 1:32 PM, Anne van Kesteren ann...@opera.com wrote: On Thu, 20 May 2010 14:29:16 +0200, Bjorn Bringert bring...@google.com It should be possible to drive input type=speech with keyboard input, if the user agent chooses to implement that. Nothing in the API should require the user to actually speak. I think this is a strong argument for why input type=speech should not be replaced by a microphone API and a separate speech recognizer, since the latter would be very hard to make accessible. (I still think that there should be a microphone API for applications like audio chat, but that's a separate discussion). So why not implement speech support on top of the existing input types? Speech-driven keyboards certainly get you some of the benefits of input type=speech, but they give the application developer less control and less information than a speech-specific API. Some advantages of a dedicated speech input type: It's more important that users have control (e.g. on whether they want to input text by voice or typing) than devs. Devs don't know the needs of every single user of their forms. Also, I don't see any new speech-specific - Application-defined grammars. This is important for getting high recognition accuracy in with limited domains. This may be true, but does this require a new type? I really don't know. - Allows continuous speech recognition where the app gets events on speech endpoints. Please describe how exactly this is different from continuous text input. - Doesn't require the input element to have keyboard focus while speaking. Neither does input type=text if the user chooses to input text into it with voice. It requires microphone focus (termed activated in draft). Anything else is a usibility issue in the app, not in the form spec. - Doesn't require a visible text input field. HTML does not (or at least shouldn't) define how elements will be presented. In especial; it does not mandate a visual interface if the user doesn't want one. See also: CSS. Also the spec clearly states that The user can click the element to move back to the not activated state. So the draft suggests a visible input element, assuming that this was an informal note and not a requirement. From the draft on http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en: Web search by voice Speech translation input type=text for client-side recognition, input type=audio for server-side. Speech-enabled webmail client Commandline interface with pronounceable commands (as is recommended for commandline interfaces in generl anyway. VoiceXML interpreter I don't see how XML interpreters relate to speech-based HTML forms. Or my definition of interpreter doesn't match yours (I don't write English natively). --- code sample from draft --- html script type=text/javascript function startSearch(event) {  var query = event.target.value;   document.getElementById(q).value = query;   // use AJAX search API to get results for   // q.value and put in #search_results. } /scriptbody form name=search_form input type=text name=q id=q input type=speech grammar=builtin:search onchange=startSearch /form div id=search_results/div /body/html --- end of code sample --- How is listening for changes on one element and moving them to another element and then submitting the form better than e.g. --- code sample --- html !-- tell browser that form is a search box -- link rel=search href=#search body form id=search !-- or name=search -- input type=search name=q id=q /form /body /html --- end of code sample --- Works sans scripting, scripted submit can be used if scripting is supported. I'd understand it if it linked to some SRGS stuff, but it doesn't. Also it brakes the @type attribute of input so you had to add /another/ attribute to tell browsers what type of information is expected to input into the input. Speech isn't a type of information. It's a way to input information. Really, you should be using CSS and JavaScript if your want fine-grained control over the user-interaction (for human users that'll use the form). Feel free to add speech recognition capabilites to JavaScript and improve CSS styling of voice media. If you wanted to integrate HTML forms and SRGS, that shouldn't brake input type.
Re: [whatwg] Speech input element
On Tue, 18 May 2010 10:52:53 +0200, Bjorn Bringert bring...@google.com wrote: On Tue, May 18, 2010 at 8:02 AM, Anne van Kesteren ann...@opera.com wrote: I wonder how it relates to the device proposal already in the draft. In theory that supports microphone input too. It would be possible to implement speech recognition on top of a microphone input API. The most obvious approach would be to use device to get an audio stream, and send that audio stream to a server (e.g. using WebSockets). The server runs a speech recognizer and returns the results. Advantages of the speech input element: - Web app developers do not need to build and maintain a speech recognition service. - Implementations can choose to use client-side speech recognition. This could give reduced network traffic and latency (but probably also reduced recognition accuracy and language support). Implementations could also use server-side recognition by default, switching to local recognition in offline or low bandwidth situations. - Using a general audio capture API would require APIs for things like audio encoding and audio streaming. Judging from the past results of specifying media features, this may be non-trivial. The speech input element turns all audio processing concerns into implementation details. - Implementations can have special UI treatment for speech input, which may be different from that for general audio capture. I guess I don't really see why this cannot be added on top of the device element. Maybe it is indeed better though to separate the too. The reason I'm mostly asking is that one reason we went with device rather than input is that the result of the user operation is not something that will partake in form submission. Now obviously a lot of use cases today for form controls do not partake in form submission but are handled by script, but all the controls that are there can be used as part of form submission. input type=speech does not seem like it can. Advantages of using a microphone API: - Web app developers get complete control over the quality and features of the speech recognizer. This is a moot point for most developers though, since they do not have the resources to run their own speech recognition service. - Fewer features to implement in browsers (assuming that a microphone API would be added anyway). Right, and I am pretty positive we will add a microphone API. What e.g. could be done is that you have a speech recognition object of some sorts that you can feed the audio stream that comes out of device. (Or indeed you feed the stream to a server via WebSocket.) -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Speech input element
On Tue, 18 May 2010 11:30:01 +0200, Bjorn Bringert bring...@google.com wrote: Yes, I agree with that. The tricky issue, as Olli points out, is whether and when the 'error' event should fire when recognition is aborted because the user moves away or gets an alert. What does XMLHttpRequest do? I don't really see how the problem is the same as with synchronous XMLHttpRequest. When you do a synchronous request nothing happens to the event loop so an alert() dialog could never happen. I think you want recording to continue though. Having a simple dialog stop video conferencing for instance would be annoying. It's only script execution that needs to be paused. I'm also not sure if I'd really want recording to stop while looking at a page in a different tab. Again, if I'm in a conference call I'm almost always doing tasks on the side. E.g. looking up past discussions, scrolling through a document we're discussing, etc. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Speech input element
I don't really see how the problem is the same as with synchronous XMLHttpRequest. When you do a synchronous request nothing happens to the event loop so an alert() dialog could never happen. I think you want recording to continue though. Having a simple dialog stop video conferencing for instance would be annoying. It's only script execution that needs to be paused. I'm also not sure if I'd really want recording to stop while looking at a page in a different tab. Again, if I'm in a conference call I'm almost always doing tasks on the side. E.g. looking up past discussions, scrolling through a document we're discussing, etc. Can you clarify how the speech input element (as described in the current API sketch) is related to video conferencing or a conference call, since it doesn't really stream audio to any place other than potentially a speech recognition server and feeds the result back to the element? -- Cheers Satish
Re: [whatwg] Speech input element
On Wed, 19 May 2010 10:22:54 +0200, Satish Sampath sat...@google.com wrote: I don't really see how the problem is the same as with synchronous XMLHttpRequest. When you do a synchronous request nothing happens to the event loop so an alert() dialog could never happen. I think you want recording to continue though. Having a simple dialog stop video conferencing for instance would be annoying. It's only script execution that needs to be paused. I'm also not sure if I'd really want recording to stop while looking at a page in a different tab. Again, if I'm in a conference call I'm almost always doing tasks on the side. E.g. looking up past discussions, scrolling through a document we're discussing, etc. Can you clarify how the speech input element (as described in the current API sketch) is related to video conferencing or a conference call, since it doesn't really stream audio to any place other than potentially a speech recognition server and feeds the result back to the element? Well, as indicated in the other thread I'm not sure whether this is the best way to do it. Usually we start with a lower-level API (i.e. microphone input) and build up from there. But maybe I'm wrong and speech input is a case that needs to be considered separately. It would still not be like synchronous XMLHttpRequest though. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Speech input element
On Wed, May 19, 2010 at 12:50 AM, Anne van Kesteren ann...@opera.com wrote: On Tue, 18 May 2010 10:52:53 +0200, Bjorn Bringert bring...@google.com wrote: ... Advantages of the speech input element: - Web app developers do not need to build and maintain a speech recognition service. But browser authors would, and it's not clear they will do so in a cross-platform, compatible way. Client devices with limited cache memory sizes and battery power aren't very good at the Viterbi beam search algorithm, which isn't helped much by small caches because it's mostly random reads across wide memory spans. - Implementations can have special UI treatment for speech input, which may be different from that for general audio capture. I guess I don't really see why this cannot be added on top of the device element. Maybe it is indeed better though to separate the two. The reason I'm mostly asking is that one reason we went with device rather than input is that the result of the user operation is not something that will partake in form submission That's not a good reason. Audio files are uploaded with input type=file all the time, but it wasn't until Flash made it possible that browser authors started considering the possibilities of microphone upload, even though they were urged to address the issue a decade ago: From: Tim Berners-Lee ti...@w3.org Date: Fri, 31 Mar 2000 16:37:02 -0500 ... This is a question of getting browser manufacturers to implement what is already in HTML HTML 4 does already include a way of requesting audio input. For instance, you can write: INPUT name=audiofile1 type=file accept=audio/* and be prompted for various means of audio input (a recorder, a mixing desk, a file icon drag and drop receptor, etc). Here file does not mean from a disk but large body of data with a MIME type. As someone who used the NeXT machine's lip service many years ago I see no reason why browsers should not implement both audio and video and still capture in this way. There are many occasions that voice input is valuable. We have speech recognition systems in the lab, for example, and of course this is very much needed So you don't need to convince me of the usefulness. However, browser writers have not implemented this! One needs to encourage this feature to be implemented, and implemented well. I hope this helps. Tim Berners-Lee Further back in January, 2000, that same basic feature request had been endorsed by more than 150 people, including: * Michael Swaine - in his article, Sounds like... - webreview.com/pub/98/08/21/frames - mswa...@swaine.com - well-known magazine columnist for and long-time editor-in-chief of Dr. Dobb's Journal * David Turner and Keith Ross of Institut Eurecom - in their paper, Asynchronous Audio Conferencing on the Web - www.eurecom.fr/~turner/papers/aconf/abstract.html - {turner,ro...@eurecom.fr * Integrating Speech Technology in Language Learning SIG - dbs.tay.ac.uk/instil - and InSTIL's ICARE committee, both chaired by Lt. Col. Stephen LaRocca - gs0...@exmail.usma.army.mil - a language instructor at the U.S. Military Academy * Dr. Goh Kawai - g...@kawai.com - a researcher in the fields of computer aided language instruction and speech recognition, and InSTIL/ICARE founding member - www.kawai.com/goh * Ruth Ross - r...@earthlab.com - IEEE Learning Technologies Standards Committee - www.earthlab.com/RCR * Phil Siviter - phil.sivi...@brighton.ac.uk - IEEE LTSC - www.it.bton.ac.uk/staff/pfs/research.htm * Safia Barikzai - s.barik...@sbu.ac.uk - IEEE LTSC - www.sbu.ac.uk/barikzai * Gene Haldeman - g...@gene-haldeman.com - Computer Professionals for Social Responsibility, Ethics Working Group * Steve Teicher - steve-teic...@att.net - University of Central Florida; CPSR Education Working Group * Dr. Melissa Holland - mholl...@arl.mil - team leader for the U.S. Army Research Laboratory's Language Technology Group * Tull Jenkins - jenki...@atsc.army.mil - U.S. Army Training Support Centers However, W3C decided not to move forward with the implementation details at http://www.w3.org/TR/device-upload because they were said to be device dependent, which was completely meaningless, really. Regards, James Salsman
Re: [whatwg] Speech input element
Has anyone spent any time imagining what a microphone/video-camera API that supports the video conference use case might look like? If so, it'd be great to see a link. My guess is that it's going to be much more complicated and much more invasive security wise. Looking at Bjorn's proposal, it seems as though it fairly elegantly supports the use cases while avoiding the need for explicit permission requests (i.e. infobars, modal dialogs, etc) since permission is implicitly granted every time it's used and permission is revoked when, for example, the window loses focus. I'd be very excited if a WG took a serious look at microphone/video-camera/etc, but I suspect that speech to text is enough of a special case (in terms of how it's often implemented in hardware and in terms of security) that it won't be possible to fold into a more general microphone/video-camera/etc API without losing ease of use, which is pretty central the use cases listed in Bjorn's doc. J On Wed, May 19, 2010 at 9:30 AM, Anne van Kesteren ann...@opera.com wrote: On Wed, 19 May 2010 10:22:54 +0200, Satish Sampath sat...@google.com wrote: I don't really see how the problem is the same as with synchronous XMLHttpRequest. When you do a synchronous request nothing happens to the event loop so an alert() dialog could never happen. I think you want recording to continue though. Having a simple dialog stop video conferencing for instance would be annoying. It's only script execution that needs to be paused. I'm also not sure if I'd really want recording to stop while looking at a page in a different tab. Again, if I'm in a conference call I'm almost always doing tasks on the side. E.g. looking up past discussions, scrolling through a document we're discussing, etc. Can you clarify how the speech input element (as described in the current API sketch) is related to video conferencing or a conference call, since it doesn't really stream audio to any place other than potentially a speech recognition server and feeds the result back to the element? Well, as indicated in the other thread I'm not sure whether this is the best way to do it. Usually we start with a lower-level API (i.e. microphone input) and build up from there. But maybe I'm wrong and speech input is a case that needs to be considered separately. It would still not be like synchronous XMLHttpRequest though. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Speech input element
I am a little concerned that we are increasingly breaking down a metaphor, a 'virtual interface' without realizing what that abstraction buys us. At the moment, we have the concept of a hypothetical pointer and hypothetical keyboard, (with some abstract states, such as focus) that you can actually drive using a whole bunch of physical modalities. If we develop UIs that are specific to people actually speaking, we have 'torn the veil' of that abstract interface. What happens to people who cannot speak, for example? Or who cannot say the language needed well enough to be recognized? David Singer Multimedia and Software Standards, Apple Inc.
Re: [whatwg] Speech input element
On Thu, May 20, 2010 at 12:38 AM, David Singer sin...@apple.com wrote: I am a little concerned that we are increasingly breaking down a metaphor, a 'virtual interface' without realizing what that abstraction buys us. I'm more than a little concerned about this and hope that we tread much more carefully than it seems some parties are willing to do. I'm glad I'm not alone. At the moment, we have the concept of a hypothetical pointer and hypothetical keyboard, (with some abstract states, such as focus) that you can actually drive using a whole bunch of physical modalities. If we develop UIs that are specific to people actually speaking, we have 'torn the veil' of that abstract interface. What happens to people who cannot speak, for example? Or who cannot say the language needed well enough to be recognized?
Re: [whatwg] Speech input element
On Mon, 17 May 2010 15:05:22 +0200, Bjorn Bringert bring...@google.com wrote: Back in December there was a discussion about web APIs for speech recognition and synthesis that saw a decent amount of interest (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281). Based on that discussion, we would like to propose a simple API for speech recognition, using a new input type=speech element. An informal spec of the new API, along with some sample apps and use cases can be found at: http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en. It would be very helpful if you could take a look and share your comments. Our next steps will be to implement the current design, get some feedback from web developers, continue to tweak, and seek standardization as soon it looks mature enough and/or other vendors become interested in implementing it. I wonder how it relates to the device proposal already in the draft. In theory that supports microphone input too. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Speech input element
On Mon, May 17, 2010 at 9:23 PM, Olli Pettay olli.pet...@helsinki.fi wrote: On 5/17/10 6:55 PM, Bjorn Bringert wrote: (Looks like half of the first question is missing, so I'm guessing here) If you are asking about when the web app loses focus (e.g. the user switches to a different tab or away from the browser), I think the recognition should be cancelled. I've added this to the spec. Oh, where did the rest of the question go. I was going to ask about alert()s. What happens if alert() pops up while recognition is on? Which events should fire and when? Hmm, good question. I think that either the recognition should be cancelled, like when the web app loses focus, or it should continue just as if there was no alert. Are there any browser implementation reasons to do one or the other? The grammar specifies the set of utterances that the speech recognizer should match against. The grammar may be annotated with SISR, which will be used to populate the 'interpretation' field in ListenResult. I know what grammars are :) Yeah, sorry about my silly reply there, I just wasn't sure exactly what you were asking. What I meant that it is not very well specified that the result is actually put to .value etc. Yes, good point. The alternatives would be to use either the 'utterance' or the 'interpretation' value from the most likely recognition result. If the grammar does not contain semantics, those are identical, so it doesn't matter in that case. If the developer has added semantics to the grammar, the interpretation is probably more interesting than the utterance. So my conclusion is that it would make most sense to store the interpretation in @value. I've updated the spec with better definitions of @value and @results. And still, I'm still not quite sure what builtin:search actually is. What kind of grammar would that be? How is that different from builtin:dictation? To be useful, those should probably be large statistical language models (e.g. n-gram models) trained on different corpora. So builtin:dictation might be trained on a corpus containing e-mails, SMS messages and news text, and builtin:search might be trained on query strings from a search engine. I've updated the spec to make builtin:search optional, mapping to builtin:dictation if not implemented. The exact language matched by these models would be implementation dependent, and implementations may choose to be clever about them. For example by: - Dynamic tweaking for different web apps based on the user's previous inputs and the text contained in the web app. - Adding the names of all contacts from the user's address book to the dictation model. - Weighting place names based on geographic proximity (in an implementation that has access to the user's location). -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Speech input element
On Tue, May 18, 2010 at 8:02 AM, Anne van Kesteren ann...@opera.com wrote: On Mon, 17 May 2010 15:05:22 +0200, Bjorn Bringert bring...@google.com wrote: Back in December there was a discussion about web APIs for speech recognition and synthesis that saw a decent amount of interest (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281). Based on that discussion, we would like to propose a simple API for speech recognition, using a new input type=speech element. An informal spec of the new API, along with some sample apps and use cases can be found at: http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en. It would be very helpful if you could take a look and share your comments. Our next steps will be to implement the current design, get some feedback from web developers, continue to tweak, and seek standardization as soon it looks mature enough and/or other vendors become interested in implementing it. I wonder how it relates to the device proposal already in the draft. In theory that supports microphone input too. It would be possible to implement speech recognition on top of a microphone input API. The most obvious approach would be to use device to get an audio stream, and send that audio stream to a server (e.g. using WebSockets). The server runs a speech recognizer and returns the results. Advantages of the speech input element: - Web app developers do not need to build and maintain a speech recognition service. - Implementations can choose to use client-side speech recognition. This could give reduced network traffic and latency (but probably also reduced recognition accuracy and language support). Implementations could also use server-side recognition by default, switching to local recognition in offline or low bandwidth situations. - Using a general audio capture API would require APIs for things like audio encoding and audio streaming. Judging from the past results of specifying media features, this may be non-trivial. The speech input element turns all audio processing concerns into implementation details. - Implementations can have special UI treatment for speech input, which may be different from that for general audio capture. Advantages of using a microphone API: - Web app developers get complete control over the quality and features of the speech recognizer. This is a moot point for most developers though, since they do not have the resources to run their own speech recognition service. - Fewer features to implement in browsers (assuming that a microphone API would be added anyway). -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Speech input element
On 5/18/10 11:27 AM, Bjorn Bringert wrote: On Mon, May 17, 2010 at 9:23 PM, Olli Pettayolli.pet...@helsinki.fi wrote: On 5/17/10 6:55 PM, Bjorn Bringert wrote: (Looks like half of the first question is missing, so I'm guessing here) If you are asking about when the web app loses focus (e.g. the user switches to a different tab or away from the browser), I think the recognition should be cancelled. I've added this to the spec. Oh, where did the rest of the question go. I was going to ask about alert()s. What happens if alert() pops up while recognition is on? Which events should fire and when? Hmm, good question. I think that either the recognition should be cancelled, like when the web app loses focus, or it should continue just as if there was no alert. Are there any browser implementation reasons to do one or the other? Well, the problem with alert is that the assumption (which may or may not always hold) is that when alert() is opened, web page shouldn't run any scripts. So should input type=speech fire some events when the recognition is canceled (if alert cancels recognition), and if yes, when? Or if recognition is not canceled, and something is recognized (so input event should be dispatched), when should the event actually fire? The problem is pretty much the same with synchronous XMLHttpRequest. -Olli
Re: [whatwg] Speech input element
Well, the problem with alert is that the assumption (which may or may not always hold) is that when alert() is opened, web page shouldn't run any scripts. So should input type=speech fire some events when the recognition is canceled (if alert cancels recognition), and if yes, when? Or if recognition is not canceled, and something is recognized (so input event should be dispatched), when should the event actually fire? The problem is pretty much the same with synchronous XMLHttpRequest. In my opinion, once the speech input element has started recording any event which takes the user's focus away from actually speaking should ideally stop the speech recognition. This would include switching to a new window, a new tab or modal/alert dialogs, submitting a form or navigating to a new page in the same tab/window. -- Cheers Satish
Re: [whatwg] Speech input element
Hi Bjorn, Thank you for your bringing this topic (again :) to the WHAT WG list. I'd like to bring this to the W3C Voice Browser Working Group (and maybe the Multimodal Interaction Working Group as well) and ask the group participants for opinion. As you might know, the group recently created a task force named Voice on the Web and work hard to promote voice technology on various possible Web applications. Regards, Kazuyuki Bjorn Bringert wrote: Back in December there was a discussion about web APIs for speech recognition and synthesis that saw a decent amount of interest (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281). Based on that discussion, we would like to propose a simple API for speech recognition, using a new input type=speech element. An informal spec of the new API, along with some sample apps and use cases can be found at: http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en. It would be very helpful if you could take a look and share your comments. Our next steps will be to implement the current design, get some feedback from web developers, continue to tweak, and seek standardization as soon it looks mature enough and/or other vendors become interested in implementing it. -- Kazuyuki Ashimura / W3C Multimodal Voice Activity Lead mailto: ashim...@w3.org voice: +81.466.49.1170 / fax: +81.466.49.1171
Re: [whatwg] Speech input element
Hi Bjorn and James, Just FYI, W3C is organizing a workshop on Conversational Applications. The main goal of the workshop is collecting use cases and requirements for new models of human language to support mobile conversational systems. The workshop will be held on June 18-19 in Somerset, NJ, US. The detailed call for participation is available at: http://www.w3.org/2010/02/convapps/cfp.html I think there may be some discussion during the workshop about a possible multimodal e-learning system as a use case. Is either of you by chance interested in the workshop? Regards, Kazuyuki Bjorn Bringert wrote: On Mon, May 17, 2010 at 10:55 PM, James Salsman jsals...@gmail.com wrote: On Mon, May 17, 2010 at 8:55 AM, Bjorn Bringert bring...@google.com wrote: - What exactly are grammars builtin:dictation and builtin:search? They are intended to be implementation-dependent large language models, for dictation (e.g. e-mail writing) and search queries respectively. I've tried to clarify them a bit in the spec now. There should perhaps be more of these (e.g. builtin:address), maybe with some optional, mapping to builtin:dictation if not available. Bjorn, are you interested in including speech recognition support for pronunciation assessment such as is done by http://englishcentral.com/ , http://www.scilearn.com/products/reading-assistant/ , http://www.eyespeakenglish.com/ , and http://wizworldonline.com/ , http://www.8dworld.com/en/home.html ? Those would require different sorts of language models and grammars such as those described in http://www.springerlink.com/content/l0385t6v425j65h7/ Please let me know your thoughts. I don't have SpringerLink access, so I couldn't read that article. As far as I could tell from the abstract, they use phoneme-level speech recognition and then calculate the edit distance to the correct phoneme sequences. Do you have a concrete proposal for how this could be supported? Would support for PLS (http://www.w3.org/TR/pronunciation-lexicon/) links in SRGS be enough (the SRGS spec already includes that)? -- Kazuyuki Ashimura / W3C Multimodal Voice Activity Lead mailto: ashim...@w3.org voice: +81.466.49.1170 / fax: +81.466.49.1171
Re: [whatwg] Speech input element
Hi, Bjorn- Bjorn Bringert wrote (on 5/17/10 9:05 AM): Back in December there was a discussion about web APIs for speech recognition and synthesis that saw a decent amount of interest (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281). Based on that discussion, we would like to propose a simple API for speech recognition, using a newinput type=speech element. An informal spec of the new API, along with some sample apps and use cases can be found at: http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en. It would be very helpful if you could take a look and share your comments. Our next steps will be to implement the current design, get some feedback from web developers, continue to tweak, and seek standardization as soon it looks mature enough and/or other vendors become interested in implementing it. This is important work, thanks for taking it on and bringing it to a wider discussion forum. Here's a couple of other venues you might also consider discussing it, above and beyond discussion on the WHATWG list: * W3C just launched a new Audio Incubator Group (Audio XG), as a forum to discuss various aspects of audio on the Web. The Audio XG is not intended to produce Recommendation-track specifications like this (though they will likely prototype and write a draft spec for a read-write audio API), but it could serve a role in helping work out use cases and requirements, reviewing specs, and so forth. I'm not totally sure that this is relevant to your interests, but I thought I would bring it up. * The Voice Browser Working Group is very interested in bringing their work and experience into the graphical browser world, so you should work with them or get their input. As I understand it, some of them plan to join the Audio XG, too (specifically to talk about speech synthesis in the larger context), so that might be one forum to have some conversations. VoiceXML is rather different than X/HTML or the browser DOM, and the participants in the VBWG don't necessarily have the right experience in graphical browser approaches, so I think there's an opportunity for good conversation and cross-pollination here. [1] http://www.w3.org/2005/Incubator/audio/ [2] http://www.w3.org/Voice/ Regards- -Doug
Re: [whatwg] Speech input element
On 5/17/10 4:05 PM, Bjorn Bringert wrote: Back in December there was a discussion about web APIs for speech recognition and synthesis that saw a decent amount of interest (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281). Based on that discussion, we would like to propose a simple API for speech recognition, using a newinput type=speech element. An informal spec of the new API, along with some sample apps and use cases can be found at: http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en. It would be very helpful if you could take a look and share your comments. Our next steps will be to implement the current design, get some feedback from web developers, continue to tweak, and seek standardization as soon it looks mature enough and/or other vendors become interested in implementing it. After a quick read I, in general, like the proposal. Few comments though. - What should happen if for example What happens to the events which are fired during that time? Or should recognition stop? - What exactly are grammars builtin:dictation and builtin:search? Especially the latter one is not at all clear to me - When does recognitionState change? Before which events? - It is not quite clear how SGRS works with input type=speech - I believe there is no need for DOMImplementation.hasFeature(SpeechInput, 1.0) And I think we really need to define something for TTS. Not every web developers have servers for text - audio. -Olli
Re: [whatwg] Speech input element
On Mon, May 17, 2010 at 3:00 PM, Olli Pettay olli.pet...@helsinki.fi wrote: On 5/17/10 4:05 PM, Bjorn Bringert wrote: Back in December there was a discussion about web APIs for speech recognition and synthesis that saw a decent amount of interest (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281). Based on that discussion, we would like to propose a simple API for speech recognition, using a newinput type=speech element. An informal spec of the new API, along with some sample apps and use cases can be found at: http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en. It would be very helpful if you could take a look and share your comments. Our next steps will be to implement the current design, get some feedback from web developers, continue to tweak, and seek standardization as soon it looks mature enough and/or other vendors become interested in implementing it. After a quick read I, in general, like the proposal. It's pretty underspecified still, as you can see. Thanks for pointing out some missing pieces. Few comments though. - What should happen if for example What happens to the events which are fired during that time? Or should recognition stop? (Looks like half of the first question is missing, so I'm guessing here) If you are asking about when the web app loses focus (e.g. the user switches to a different tab or away from the browser), I think the recognition should be cancelled. I've added this to the spec. - What exactly are grammars builtin:dictation and builtin:search? Especially the latter one is not at all clear to me They are intended to be implementation-dependent large language models, for dictation (e.g. e-mail writing) and search queries respectively. I've tried to clarify them a bit in the spec now. There should perhaps be more of these (e.g. builtin:address), maybe with some optional, mapping to builtin:dictation if not available. - When does recognitionState change? Before which events? Thanks, that was very underspecified. I've added a diagram to clarify it. - It is not quite clear how SGRS works with input type=speech The grammar specifies the set of utterances that the speech recognizer should match against. The grammar may be annotated with SISR, which will be used to populate the 'interpretation' field in ListenResult. Since grammars may be protected by cookies etc that are only available in the browsing session, I think the user agent will have to fetch the grammar and the pass it to the speech recognizer, rather than the recognizer accessing it directly. I'm not sure if any of that answers your question though. - I believe there is no need for DOMImplementation.hasFeature(SpeechInput, 1.0) The intention was that apps could use this to conditionally enable features that require speech input support. Is there some other mechanism that should be used instead? And I think we really need to define something for TTS. Not every web developers have servers for text - audio. Yes, I agree. We intend to work on that next, but didn't include it in this proposal since they are pretty separate features from the browser point of view. -- Bjorn Bringert Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Re: [whatwg] Speech input element
On 5/17/10 6:55 PM, Bjorn Bringert wrote: (Looks like half of the first question is missing, so I'm guessing here) If you are asking about when the web app loses focus (e.g. the user switches to a different tab or away from the browser), I think the recognition should be cancelled. I've added this to the spec. Oh, where did the rest of the question go. I was going to ask about alert()s. What happens if alert() pops up while recognition is on? Which events should fire and when? The grammar specifies the set of utterances that the speech recognizer should match against. The grammar may be annotated with SISR, which will be used to populate the 'interpretation' field in ListenResult. I know what grammars are :) What I meant that it is not very well specified that the result is actually put to .value etc. And still, I'm still not quite sure what builtin:search actually is. What kind of grammar would that be? How is that different from builtin:dictation? -Olli
Re: [whatwg] Speech input element
On Mon, May 17, 2010 at 8:55 AM, Bjorn Bringert bring...@google.com wrote: - What exactly are grammars builtin:dictation and builtin:search? They are intended to be implementation-dependent large language models, for dictation (e.g. e-mail writing) and search queries respectively. I've tried to clarify them a bit in the spec now. There should perhaps be more of these (e.g. builtin:address), maybe with some optional, mapping to builtin:dictation if not available. Bjorn, are you interested in including speech recognition support for pronunciation assessment such as is done by http://englishcentral.com/ , http://www.scilearn.com/products/reading-assistant/ , http://www.eyespeakenglish.com/ , and http://wizworldonline.com/ , http://www.8dworld.com/en/home.html ? Those would require different sorts of language models and grammars such as those described in http://www.springerlink.com/content/l0385t6v425j65h7/ Please let me know your thoughts. Best regards, James Salsman - http://talknicer.com/