Re: [whatwg] Speech input element

2010-06-16 Thread Anne van Kesteren
On Tue, 15 Jun 2010 17:08:40 +0200, Satish Sampath sat...@google.com  
wrote:

To add a little more clarity - we initially proposed a speech input API
using a new input type=speech element. The top feedback we received  
was to extend speech as a form of input to existing elements instead of  
creating a new speech control. We have taken that into account in the  
new proposal

which extends speech input to existing form elements and other editable
elements. Please take a fresh look and share your thoughts.


Could you maybe post a link to the proposal? Or in case you intended to  
attach it: it didn't get through.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Speech input element

2010-06-16 Thread Satish Sampath
Please see http://docs.google.com/View?id=dcfg79pz_5dhnp23f5 for the
new proposal (Bjorn's earlier post had this link).

Cheers
Satish


Re: [whatwg] Speech input element

2010-06-15 Thread Satish Sampath
To add a little more clarity - we initially proposed a speech input API
using a new input type=speech element. The top feedback we received was
to extend speech as a form of input to existing elements instead of creating
a new speech control. We have taken that into account in the new proposal
which extends speech input to existing form elements and other editable
elements. Please take a fresh look and share your thoughts.

--
Cheers
Satish


Re: [whatwg] Speech input element

2010-06-15 Thread Bjartur Thorlacius
From TFA:
 We would like some way of having speech control in a web application, without 
 any input
 fields. For example, in a webmail client, there are buttons, links etc that 
 lets the user take
 actions such as deleting or replying to email. We would like to make it easy 
 to implement a
 speech interface where the user can say read next message, archive, 
 reply etc, without
 having to show a text field where the same commands can be typed.
link rel=next title=Read next message.
form action=archive methord=POST title=Archive message !--
@method=MOVE? --
  button type=submit
/form
--
kv,
  - Bjartur


Re: [whatwg] Speech input element

2010-06-15 Thread Bjartur Thorlacius
---From TFA---
A web search application can accept speech input, and perform a search
immediately when the input is recognized. If it has access to the
additional recognition hypothesis (aka N-best list), it can display
that on the search results page and let the user chose the correct
query if the input was misrecognized. For example, Google search might
display search results for recognize speech, and show a link with
the text Did you say 'wreck a nice beach'?.
---  ---
User-agents can submit any GET forms immediately. Also they may keep
the form open for editing and list correction suggestions alongside as
the form gets submitted and results are shown to the user. That
shouldn't get mixed up with the results.
--
kv,
  - Bjartur


Re: [whatwg] Speech input element

2010-06-14 Thread Bjorn Bringert
Based on the feedback in this thread we've worked out a new speech
input proposal that adds a @speech attribute to most input elements,
instead of a new input type=speech. Please see
http://docs.google.com/View?id=dcfg79pz_5dhnp23f5 for the new
proposal.

/Bjorn Bringert  Satish Sampath

-- 
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-20 Thread Bjorn Bringert
On Thu, May 20, 2010 at 1:32 PM, Anne van Kesteren ann...@opera.com wrote:
 On Thu, 20 May 2010 14:29:16 +0200, Bjorn Bringert bring...@google.com
 wrote:

 It should be possible to drive input type=speech with keyboard
 input, if the user agent chooses to implement that. Nothing in the API
 should require the user to actually speak. I think this is a strong
 argument for why input type=speech should not be replaced by a
 microphone API and a separate speech recognizer, since the latter
 would be very hard to make accessible. (I still think that there
 should be a microphone API for applications like audio chat, but
 that's a separate discussion).

 So why not implement speech support on top of the existing input types?

Speech-driven keyboards certainly get you some of the benefits of
input type=speech, but they give the application developer less
control and less information than a speech-specific API. Some
advantages of a dedicated speech input type:

- Application-defined grammars. This is important for getting high
recognition accuracy in with limited domains.

- Allows continuous speech recognition where the app gets events on
speech endpoints.

- Multiple recognition hypotheses. This lets applications implement
intelligent input disambiguation.

- Doesn't require the input element to have keyboard focus while speaking.

- Doesn't require a visible text input field.

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-20 Thread bjartur
 
On Thu, 20 May 2010 14:18:56 +0100, Bjorn Bringert bring...@google.com wrote:
 On Thu, May 20, 2010 at 1:32 PM, Anne van Kesteren ann...@opera.com wrote:
  On Thu, 20 May 2010 14:29:16 +0200, Bjorn Bringert bring...@google.com
 
  It should be possible to drive input type=speech with keyboard
  input, if the user agent chooses to implement that. Nothing in the API
  should require the user to actually speak. I think this is a strong
  argument for why input type=speech should not be replaced by a
  microphone API and a separate speech recognizer, since the latter
  would be very hard to make accessible. (I still think that there
  should be a microphone API for applications like audio chat, but
  that's a separate discussion).
 
  So why not implement speech support on top of the existing input types?

  Speech-driven keyboards certainly get you some of the benefits of
 input type=speech, but they give the application developer less
 control and less information than a speech-specific API. Some
 advantages of a dedicated speech input type:
It's more important that users have control (e.g. on whether they want
to input text by voice or typing) than devs. Devs don't know the needs
of every single user of their forms.

Also, I don't see any new speech-specific 
 - Application-defined grammars. This is important for getting high
 recognition accuracy in with limited domains.
This may be true, but does this require a new type? I really don't know.
 - Allows continuous speech recognition where the app gets events on
 speech endpoints.
Please describe how exactly this is different from continuous text input.
 - Doesn't require the input element to have keyboard focus while speaking.
Neither does input type=text if the user chooses to input text into 
it
with voice. It requires microphone focus (termed activated in draft).
Anything else is a usibility issue in the app, not in the form spec.
 - Doesn't require a visible text input field.
HTML does not (or at least shouldn't) define how elements will be 
presented.
In especial; it does not mandate a visual interface if the user doesn't want
one. See also: CSS.

Also the spec clearly states that The user can click the element to 
move
back to the not activated state. So the draft suggests a visible input element,
assuming that this was an informal note and not a requirement.

From the draft on 
http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en:
 Web search by voice
 Speech translation
input type=text for client-side recognition, input type=audio for 
server-side.
 Speech-enabled webmail client
Commandline interface with pronounceable commands (as is recommended for 
commandline
interfaces in generl anyway.
 VoiceXML interpreter
I don't see how XML interpreters relate to speech-based HTML forms.
Or my definition of interpreter doesn't match yours (I don't write English 
natively).

--- code sample from draft ---
html
script type=text/javascript
function startSearch(event) {
  var query = event.target.value;
  document.getElementById(q).value = query;
  // use AJAX search API to get results for 
  // q.value and put in #search_results.
}
/scriptbody

form name=search_form
input type=text name=q id=q
input type=speech grammar=builtin:search onchange=startSearch
/form

div id=search_results/div

/body/html
--- end of code sample ---
How is listening for changes on one element and moving them to another element
and then submitting the form better than e.g.
--- code sample ---
html
!-- tell browser that form is a search box --
link rel=search href=#search 
body
form id=search !-- or name=search --
input type=search name=q id=q
/form
/body
/html
--- end of code sample ---
Works sans scripting, scripted submit can be used if scripting is supported.
I'd understand it if it linked to some SRGS stuff, but it doesn't.
Also it brakes the @type attribute of input so you had to add /another/
attribute to tell browsers what type of information is expected to input
into the input. Speech isn't a type of information. It's a way to input
information.

Really, you should be using CSS and JavaScript if your want
fine-grained control over the user-interaction (for human users that'll
use the form). Feel free to add speech recognition capabilites to
JavaScript and improve CSS styling of voice media.

If you wanted to integrate HTML forms and SRGS, that shouldn't brake input
type.


Re: [whatwg] Speech input element

2010-05-19 Thread Anne van Kesteren
On Tue, 18 May 2010 10:52:53 +0200, Bjorn Bringert bring...@google.com  
wrote:
On Tue, May 18, 2010 at 8:02 AM, Anne van Kesteren ann...@opera.com  
wrote:
I wonder how it relates to the device proposal already in the draft.  
In theory that supports microphone input too.


It would be possible to implement speech recognition on top of a
microphone input API. The most obvious approach would be to use
device to get an audio stream, and send that audio stream to a
server (e.g. using WebSockets). The server runs a speech recognizer
and returns the results.

Advantages of the speech input element:

- Web app developers do not need to build and maintain a speech
recognition service.

- Implementations can choose to use client-side speech recognition.
This could give reduced network traffic and latency (but probably also
reduced recognition accuracy and language support). Implementations
could also use server-side recognition by default, switching to local
recognition in offline or low bandwidth situations.

- Using a general audio capture API would require APIs for things like
audio encoding and audio streaming. Judging from the past results of
specifying media features, this may be non-trivial. The speech input
element turns all audio processing concerns into implementation
details.

- Implementations can have special UI treatment for speech input,
which may be different from that for general audio capture.


I guess I don't really see why this cannot be added on top of the device  
element. Maybe it is indeed better though to separate the too. The reason  
I'm mostly asking is that one reason we went with device rather than  
input is that the result of the user operation is not something that  
will partake in form submission. Now obviously a lot of use cases today  
for form controls do not partake in form submission but are handled by  
script, but all the controls that are there can be used as part of form  
submission. input type=speech does not seem like it can.




Advantages of using a microphone API:

- Web app developers get complete control over the quality and
features of the speech recognizer. This is a moot point for most
developers though, since they do not have the resources to run their
own speech recognition service.

- Fewer features to implement in browsers (assuming that a microphone
API would be added anyway).


Right, and I am pretty positive we will add a microphone API. What e.g.  
could be done is that you have a speech recognition object of some sorts  
that you can feed the audio stream that comes out of device. (Or indeed  
you feed the stream to a server via WebSocket.)



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Speech input element

2010-05-19 Thread Anne van Kesteren
On Tue, 18 May 2010 11:30:01 +0200, Bjorn Bringert bring...@google.com  
wrote:

Yes, I agree with that. The tricky issue, as Olli points out, is
whether and when the 'error' event should fire when recognition is
aborted because the user moves away or gets an alert. What does
XMLHttpRequest do?


I don't really see how the problem is the same as with synchronous  
XMLHttpRequest. When you do a synchronous request nothing happens to the  
event loop so an alert() dialog could never happen. I think you want  
recording to continue though. Having a simple dialog stop video  
conferencing for instance would be annoying. It's only script execution  
that needs to be paused. I'm also not sure if I'd really want recording to  
stop while looking at a page in a different tab. Again, if I'm in a  
conference call I'm almost always doing tasks on the side. E.g. looking up  
past discussions, scrolling through a document we're discussing, etc.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Speech input element

2010-05-19 Thread Satish Sampath

 I don't really see how the problem is the same as with synchronous
 XMLHttpRequest. When you do a synchronous request nothing happens to the
 event loop so an alert() dialog could never happen. I think you want
 recording to continue though. Having a simple dialog stop video conferencing
 for instance would be annoying. It's only script execution that needs to be
 paused. I'm also not sure if I'd really want recording to stop while looking
 at a page in a different tab. Again, if I'm in a conference call I'm almost
 always doing tasks on the side. E.g. looking up past discussions, scrolling
 through a document we're discussing, etc.


Can you clarify how the speech input element (as described in the current
API sketch) is related to video conferencing or a conference call, since it
doesn't really stream audio to any place other than potentially a speech
recognition server and feeds the result back to the element?

--
Cheers
Satish


Re: [whatwg] Speech input element

2010-05-19 Thread Anne van Kesteren
On Wed, 19 May 2010 10:22:54 +0200, Satish Sampath sat...@google.com  
wrote:

I don't really see how the problem is the same as with synchronous
XMLHttpRequest. When you do a synchronous request nothing happens to the
event loop so an alert() dialog could never happen. I think you want
recording to continue though. Having a simple dialog stop video  
conferencing
for instance would be annoying. It's only script execution that needs  
to be paused. I'm also not sure if I'd really want recording to stop  
while looking at a page in a different tab. Again, if I'm in a  
conference call I'm almost always doing tasks on the side. E.g. looking  
up past discussions, scrolling through a document we're discussing, etc.


Can you clarify how the speech input element (as described in the current
API sketch) is related to video conferencing or a conference call, since  
it doesn't really stream audio to any place other than potentially a  
speech

recognition server and feeds the result back to the element?


Well, as indicated in the other thread I'm not sure whether this is the  
best way to do it. Usually we start with a lower-level API (i.e.  
microphone input) and build up from there. But maybe I'm wrong and speech  
input is a case that needs to be considered separately. It would still not  
be like synchronous XMLHttpRequest though.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Speech input element

2010-05-19 Thread James Salsman
On Wed, May 19, 2010 at 12:50 AM, Anne van Kesteren ann...@opera.com wrote:
 On Tue, 18 May 2010 10:52:53 +0200, Bjorn Bringert bring...@google.com 
 wrote:
...
 Advantages of the speech input element:

 - Web app developers do not need to build and maintain a speech
 recognition service.

But browser authors would, and it's not clear they will do so in a
cross-platform, compatible way.  Client devices with limited cache
memory sizes and battery power aren't very good at the Viterbi beam
search algorithm, which isn't helped much by small caches because it's
mostly random reads across wide memory spans.

 - Implementations can have special UI treatment for speech input,
 which may be different from that for general audio capture.

 I guess I don't really see why this cannot be added on top of the device
 element. Maybe it is indeed better though to separate the two. The reason
 I'm mostly asking is that one reason we went with device rather than
 input is that the result of the user operation is not something that will
 partake in form submission

That's not a good reason.  Audio files are uploaded with input
type=file all the time, but it wasn't until Flash made it possible
that browser authors started considering the possibilities of
microphone upload, even though they were urged to address the issue a
decade ago:

 From: Tim Berners-Lee ti...@w3.org
 Date: Fri, 31 Mar 2000 16:37:02 -0500
...
 This is a question of getting browser manufacturers to
 implement what is already in HTML  HTML 4 does already
 include a way of requesting audio input.  For instance,
 you can write:

 INPUT name=audiofile1 type=file accept=audio/*

 and be prompted for various means of audio input (a recorder,
 a mixing desk, a file icon drag and drop receptor, etc).
 Here file does not mean from a disk but large body of
 data with a MIME type.

 As someone who used the NeXT machine's lip service many
 years ago I see no reason why browsers should not implement
 both audio and video and still capture in this way.   There
 are many occasions that voice input is valuable. We have speech
 recognition systems in the lab, for example, and of course this
 is very much needed  So you don't need to convince me of
 the usefulness.

 However, browser writers have not implemented this!

 One needs to encourage this feature to be implemented, and
 implemented well.

 I hope this helps.

 Tim Berners-Lee

Further back in January, 2000, that same basic feature request had
been endorsed by more than 150 people, including:

* Michael Swaine - in his article, Sounds like... -
webreview.com/pub/98/08/21/frames  - mswa...@swaine.com - well-known
magazine columnist for and long-time editor-in-chief of Dr. Dobb's
Journal
* David Turner and Keith Ross of Institut Eurecom - in their
paper, Asynchronous Audio Conferencing on the Web -
www.eurecom.fr/~turner/papers/aconf/abstract.html -
{turner,ro...@eurecom.fr
* Integrating Speech Technology in Language Learning SIG -
dbs.tay.ac.uk/instil - and InSTIL's ICARE committee, both chaired by
Lt. Col. Stephen LaRocca - gs0...@exmail.usma.army.mil - a language
instructor at the U.S. Military Academy
* Dr. Goh Kawai - g...@kawai.com - a researcher in the fields of
computer aided language instruction and speech recognition, and
InSTIL/ICARE founding member - www.kawai.com/goh
* Ruth Ross - r...@earthlab.com - IEEE Learning Technologies
Standards Committee - www.earthlab.com/RCR
* Phil Siviter - phil.sivi...@brighton.ac.uk - IEEE LTSC -
www.it.bton.ac.uk/staff/pfs/research.htm
* Safia Barikzai - s.barik...@sbu.ac.uk - IEEE LTSC - www.sbu.ac.uk/barikzai
* Gene Haldeman - g...@gene-haldeman.com - Computer Professionals
for Social Responsibility, Ethics Working Group
* Steve Teicher - steve-teic...@att.net - University of Central
Florida; CPSR Education Working Group
* Dr. Melissa Holland - mholl...@arl.mil - team leader for the
U.S. Army Research Laboratory's Language Technology Group
* Tull Jenkins - jenki...@atsc.army.mil - U.S. Army Training
Support Centers

However, W3C decided not to move forward with the implementation
details at http://www.w3.org/TR/device-upload because they were said
to be device dependent, which was completely meaningless, really.

Regards,
James Salsman


Re: [whatwg] Speech input element

2010-05-19 Thread Jeremy Orlow
Has anyone spent any time imagining what a microphone/video-camera API that
supports the video conference use case might look like?  If so, it'd be
great to see a link.

My guess is that it's going to be much more complicated and much more
invasive security wise.  Looking at Bjorn's proposal, it seems as though it
fairly elegantly supports the use cases while avoiding the need for explicit
permission requests (i.e. infobars, modal dialogs, etc) since permission is
implicitly granted every time it's used and permission is revoked when, for
example, the window loses focus.

I'd be very excited if a WG took a serious look at
microphone/video-camera/etc, but I suspect that speech to text is enough of
a special case (in terms of how it's often implemented in hardware and in
terms of security) that it won't be possible to fold into a more general
microphone/video-camera/etc API without losing ease of use, which is pretty
central the use cases listed in Bjorn's doc.

J

On Wed, May 19, 2010 at 9:30 AM, Anne van Kesteren ann...@opera.com wrote:

 On Wed, 19 May 2010 10:22:54 +0200, Satish Sampath sat...@google.com
 wrote:

 I don't really see how the problem is the same as with synchronous
 XMLHttpRequest. When you do a synchronous request nothing happens to the
 event loop so an alert() dialog could never happen. I think you want
 recording to continue though. Having a simple dialog stop video
 conferencing
 for instance would be annoying. It's only script execution that needs to
 be paused. I'm also not sure if I'd really want recording to stop while
 looking at a page in a different tab. Again, if I'm in a conference call I'm
 almost always doing tasks on the side. E.g. looking up past discussions,
 scrolling through a document we're discussing, etc.


 Can you clarify how the speech input element (as described in the current
 API sketch) is related to video conferencing or a conference call, since
 it doesn't really stream audio to any place other than potentially a speech
 recognition server and feeds the result back to the element?


 Well, as indicated in the other thread I'm not sure whether this is the
 best way to do it. Usually we start with a lower-level API (i.e. microphone
 input) and build up from there. But maybe I'm wrong and speech input is a
 case that needs to be considered separately. It would still not be like
 synchronous XMLHttpRequest though.



 --
 Anne van Kesteren
 http://annevankesteren.nl/



Re: [whatwg] Speech input element

2010-05-19 Thread David Singer
I am a little concerned that we are increasingly breaking down a metaphor, a 
'virtual interface' without realizing what that abstraction buys us.  At the 
moment, we have the concept of a hypothetical pointer and hypothetical 
keyboard, (with some abstract states, such as focus) that you can actually 
drive using a whole bunch of physical modalities.  If we develop UIs that are 
specific to people actually speaking, we have 'torn the veil' of that abstract 
interface.  What happens to people who cannot speak, for example? Or who cannot 
say the language needed well enough to be recognized?


David Singer
Multimedia and Software Standards, Apple Inc.



Re: [whatwg] Speech input element

2010-05-19 Thread timeless
On Thu, May 20, 2010 at 12:38 AM, David Singer sin...@apple.com wrote:
 I am a little concerned that we are increasingly breaking down a metaphor,
 a 'virtual interface' without realizing what that abstraction buys us.

I'm more than a little concerned about this and hope that we tread
much more carefully than it seems some parties are willing to do. I'm
glad I'm not alone.

 At the moment, we have the concept of a hypothetical pointer and hypothetical
 keyboard, (with some abstract states, such as focus) that you can actually 
 drive
 using a whole bunch of physical modalities.

 If we develop UIs that are specific to people actually speaking, we have
 'torn the veil' of that abstract interface.  What happens to people who cannot
 speak, for example? Or who cannot say the language needed well enough
 to be recognized?


Re: [whatwg] Speech input element

2010-05-18 Thread Anne van Kesteren
On Mon, 17 May 2010 15:05:22 +0200, Bjorn Bringert bring...@google.com  
wrote:

Back in December there was a discussion about web APIs for speech
recognition and synthesis that saw a decent amount of interest
(http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
Based on that discussion, we would like to propose a simple API for
speech recognition, using a new input type=speech element. An
informal spec of the new API, along with some sample apps and use
cases can be found at:
http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en.

It would be very helpful if you could take a look and share your
comments. Our next steps will be to implement the current design, get
some feedback from web developers, continue to tweak, and seek
standardization as soon it looks mature enough and/or other vendors
become interested in implementing it.


I wonder how it relates to the device proposal already in the draft. In  
theory that supports microphone input too.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Speech input element

2010-05-18 Thread Bjorn Bringert
On Mon, May 17, 2010 at 9:23 PM, Olli Pettay olli.pet...@helsinki.fi wrote:
 On 5/17/10 6:55 PM, Bjorn Bringert wrote:

 (Looks like half of the first question is missing, so I'm guessing
 here) If you are asking about when the web app loses focus (e.g. the
 user switches to a different tab or away from the browser), I think
 the recognition should be cancelled. I've added this to the spec.


 Oh, where did the rest of the question go.

 I was going to ask about alert()s.
 What happens if alert() pops up while recognition is on?
 Which events should fire and when?

Hmm, good question. I think that either the recognition should be
cancelled, like when the web app loses focus, or it should continue
just as if there was no alert. Are there any browser implementation
reasons to do one or the other?


 The grammar specifies the set of utterances that the speech recognizer
 should match against. The grammar may be annotated with SISR, which
 will be used to populate the 'interpretation' field in ListenResult.

 I know what grammars are :)

Yeah, sorry about my silly reply there, I just wasn't sure exactly
what you were asking.


 What I meant that it is not very well specified that the result is actually
 put to .value etc.

Yes, good point. The alternatives would be to use either the
'utterance' or the 'interpretation' value from the most likely
recognition result. If the grammar does not contain semantics, those
are identical, so it doesn't matter in that case. If the developer has
added semantics to the grammar, the interpretation is probably more
interesting than the utterance. So my conclusion is that it would make
most sense to store the interpretation in @value. I've updated the
spec with better definitions of @value and @results.


 And still, I'm still not quite sure what builtin:search actually
 is. What kind of grammar would that be? How is that different from
 builtin:dictation?

To be useful, those should probably be large statistical language
models (e.g. n-gram models) trained on different corpora. So
builtin:dictation might be trained on a corpus containing e-mails,
SMS messages and news text, and builtin:search might be trained on
query strings from a search engine. I've updated the spec to make
builtin:search optional, mapping to builtin:dictation if not
implemented. The exact language matched by these models would be
implementation dependent, and implementations may choose to be clever
about them. For example by:

- Dynamic tweaking for different web apps based on the user's previous
inputs and the text contained in the web app.

- Adding the names of all contacts from the user's address book to the
dictation model.

- Weighting place names based on geographic proximity (in an
implementation that has access to the user's location).


-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-18 Thread Bjorn Bringert
On Tue, May 18, 2010 at 8:02 AM, Anne van Kesteren ann...@opera.com wrote:
 On Mon, 17 May 2010 15:05:22 +0200, Bjorn Bringert bring...@google.com
 wrote:

 Back in December there was a discussion about web APIs for speech
 recognition and synthesis that saw a decent amount of interest

 (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
 Based on that discussion, we would like to propose a simple API for
 speech recognition, using a new input type=speech element. An
 informal spec of the new API, along with some sample apps and use
 cases can be found at:

 http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en.

 It would be very helpful if you could take a look and share your
 comments. Our next steps will be to implement the current design, get
 some feedback from web developers, continue to tweak, and seek
 standardization as soon it looks mature enough and/or other vendors
 become interested in implementing it.

 I wonder how it relates to the device proposal already in the draft. In
 theory that supports microphone input too.

It would be possible to implement speech recognition on top of a
microphone input API. The most obvious approach would be to use
device to get an audio stream, and send that audio stream to a
server (e.g. using WebSockets). The server runs a speech recognizer
and returns the results.

Advantages of the speech input element:

- Web app developers do not need to build and maintain a speech
recognition service.

- Implementations can choose to use client-side speech recognition.
This could give reduced network traffic and latency (but probably also
reduced recognition accuracy and language support). Implementations
could also use server-side recognition by default, switching to local
recognition in offline or low bandwidth situations.

- Using a general audio capture API would require APIs for things like
audio encoding and audio streaming. Judging from the past results of
specifying media features, this may be non-trivial. The speech input
element turns all audio processing concerns into implementation
details.

- Implementations can have special UI treatment for speech input,
which may be different from that for general audio capture.


Advantages of using a microphone API:

- Web app developers get complete control over the quality and
features of the speech recognizer. This is a moot point for most
developers though, since they do not have the resources to run their
own speech recognition service.

- Fewer features to implement in browsers (assuming that a microphone
API would be added anyway).

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-18 Thread Olli Pettay

On 5/18/10 11:27 AM, Bjorn Bringert wrote:

On Mon, May 17, 2010 at 9:23 PM, Olli Pettayolli.pet...@helsinki.fi  wrote:

On 5/17/10 6:55 PM, Bjorn Bringert wrote:


(Looks like half of the first question is missing, so I'm guessing
here) If you are asking about when the web app loses focus (e.g. the
user switches to a different tab or away from the browser), I think
the recognition should be cancelled. I've added this to the spec.



Oh, where did the rest of the question go.

I was going to ask about alert()s.
What happens if alert() pops up while recognition is on?
Which events should fire and when?


Hmm, good question. I think that either the recognition should be
cancelled, like when the web app loses focus, or it should continue
just as if there was no alert. Are there any browser implementation
reasons to do one or the other?



Well, the problem with alert is that the assumption (which may or may 
not always hold) is that when alert() is opened, web page shouldn't run

any scripts. So should input type=speech fire some events when the
recognition is canceled (if alert cancels recognition), and if yes,
when? Or if recognition is not canceled, and something is recognized
(so input event should be dispatched), when should the event actually 
fire? The problem is pretty much the same with synchronous XMLHttpRequest.



-Olli


Re: [whatwg] Speech input element

2010-05-18 Thread Satish Sampath

 Well, the problem with alert is that the assumption (which may or may not
 always hold) is that when alert() is opened, web page shouldn't run
 any scripts. So should input type=speech fire some events when the
 recognition is canceled (if alert cancels recognition), and if yes,
 when? Or if recognition is not canceled, and something is recognized
 (so input event should be dispatched), when should the event actually
 fire? The problem is pretty much the same with synchronous XMLHttpRequest.


In my opinion, once the speech input element has started recording any event
which takes the user's focus away from actually speaking should ideally stop
the speech recognition. This would include switching to a new window, a new
tab or modal/alert dialogs, submitting a form or navigating to a new page in
the same tab/window.

--
Cheers
Satish


Re: [whatwg] Speech input element

2010-05-18 Thread Kazuyuki Ashimura

Hi Bjorn,

Thank you for your bringing this topic (again :) to the WHAT WG list.
I'd like to bring this to the W3C Voice Browser Working Group (and
maybe the Multimodal Interaction Working Group as well) and ask
the group participants for opinion.

As you might know, the group recently created a task force named
Voice on the Web and work hard to promote voice technology on
various possible Web applications.

Regards,

Kazuyuki


Bjorn Bringert wrote:

Back in December there was a discussion about web APIs for speech
recognition and synthesis that saw a decent amount of interest
(http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
Based on that discussion, we would like to propose a simple API for
speech recognition, using a new input type=speech element. An
informal spec of the new API, along with some sample apps and use
cases can be found at:
http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en.

It would be very helpful if you could take a look and share your
comments. Our next steps will be to implement the current design, get
some feedback from web developers, continue to tweak, and seek
standardization as soon it looks mature enough and/or other vendors
become interested in implementing it.



--
Kazuyuki Ashimura / W3C Multimodal  Voice Activity Lead
mailto: ashim...@w3.org
voice: +81.466.49.1170 / fax: +81.466.49.1171


Re: [whatwg] Speech input element

2010-05-18 Thread Kazuyuki Ashimura

Hi Bjorn and James,

Just FYI, W3C is organizing a workshop on Conversational Applications.
The main goal of the workshop is collecting use cases and requirements
for new models of human language to support mobile conversational
systems.  The workshop will be held on June 18-19 in Somerset, NJ, US.

The detailed call for participation is available at:
 http://www.w3.org/2010/02/convapps/cfp.html

I think there may be some discussion during the workshop about a
possible multimodal e-learning system as a use case.  Is either of you
by chance interested in the workshop?

Regards,

Kazuyuki


Bjorn Bringert wrote:

On Mon, May 17, 2010 at 10:55 PM, James Salsman jsals...@gmail.com wrote:

On Mon, May 17, 2010 at 8:55 AM, Bjorn Bringert bring...@google.com wrote:

- What exactly are grammars builtin:dictation and builtin:search?

They are intended to be implementation-dependent large language
models, for dictation (e.g. e-mail writing) and search queries
respectively. I've tried to clarify them a bit in the spec now. There
should perhaps be more of these (e.g. builtin:address), maybe with
some optional, mapping to builtin:dictation if not available.

Bjorn, are you interested in including speech recognition support for
pronunciation assessment such as is done by http://englishcentral.com/
, http://www.scilearn.com/products/reading-assistant/ ,
http://www.eyespeakenglish.com/ , and http://wizworldonline.com/ ,
http://www.8dworld.com/en/home.html ?

Those would require different sorts of language models and grammars
such as those described in
http://www.springerlink.com/content/l0385t6v425j65h7/

Please let me know your thoughts.


I don't have SpringerLink access, so I couldn't read that article. As
far as I could tell from the abstract, they use phoneme-level speech
recognition and then calculate the edit distance to the correct
phoneme sequences. Do you have a concrete proposal for how this could
be supported? Would support for PLS
(http://www.w3.org/TR/pronunciation-lexicon/) links in SRGS be enough
(the SRGS spec already includes that)?



--
Kazuyuki Ashimura / W3C Multimodal  Voice Activity Lead
mailto: ashim...@w3.org
voice: +81.466.49.1170 / fax: +81.466.49.1171



Re: [whatwg] Speech input element

2010-05-18 Thread Doug Schepers

Hi, Bjorn-

Bjorn Bringert wrote (on 5/17/10 9:05 AM):

Back in December there was a discussion about web APIs for speech
recognition and synthesis that saw a decent amount of interest
(http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
Based on that discussion, we would like to propose a simple API for
speech recognition, using a newinput type=speech  element. An
informal spec of the new API, along with some sample apps and use
cases can be found at:
http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en.

It would be very helpful if you could take a look and share your
comments. Our next steps will be to implement the current design, get
some feedback from web developers, continue to tweak, and seek
standardization as soon it looks mature enough and/or other vendors
become interested in implementing it.


This is important work, thanks for taking it on and bringing it to a 
wider discussion forum.  Here's a couple of other venues you might also 
consider discussing it, above and beyond discussion on the WHATWG list:


* W3C just launched a new Audio Incubator Group (Audio XG), as a forum 
to discuss various aspects of audio on the Web.  The Audio XG is not 
intended to produce Recommendation-track specifications like this 
(though they will likely prototype and write a draft spec for a 
read-write audio API), but it could serve a role in helping work out use 
cases and requirements, reviewing specs, and so forth.  I'm not totally 
sure that this is relevant to your interests, but I thought I would 
bring it up.


* The Voice Browser Working Group is very interested in bringing their 
work and experience into the graphical browser world, so you should work 
with them or get their input.  As I understand it, some of them plan to 
join the Audio XG, too (specifically to talk about speech synthesis in 
the larger context), so that might be one forum to have some 
conversations.  VoiceXML is rather different than X/HTML or the browser 
DOM, and the participants in the VBWG don't necessarily have the right 
experience in graphical browser approaches, so I think there's an 
opportunity for good conversation and cross-pollination here.


[1] http://www.w3.org/2005/Incubator/audio/
[2] http://www.w3.org/Voice/

Regards-
-Doug


Re: [whatwg] Speech input element

2010-05-17 Thread Olli Pettay

On 5/17/10 4:05 PM, Bjorn Bringert wrote:

Back in December there was a discussion about web APIs for speech
recognition and synthesis that saw a decent amount of interest
(http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
Based on that discussion, we would like to propose a simple API for
speech recognition, using a newinput type=speech  element. An
informal spec of the new API, along with some sample apps and use
cases can be found at:
http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en.

It would be very helpful if you could take a look and share your
comments. Our next steps will be to implement the current design, get
some feedback from web developers, continue to tweak, and seek
standardization as soon it looks mature enough and/or other vendors
become interested in implementing it.



After a quick read I, in general, like the proposal.

Few comments though.

- What should happen if for example
  What happens to the events which are fired during that time?
  Or should recognition stop?

- What exactly are grammars builtin:dictation and builtin:search?
  Especially the latter one is not at all clear to me

- When does recognitionState change? Before which events?

- It is not quite clear how SGRS works with input type=speech

- I believe there is no need for
  DOMImplementation.hasFeature(SpeechInput, 1.0)

And I think we really need to define something for TTS.
Not every web developers have servers for text - audio.


-Olli


Re: [whatwg] Speech input element

2010-05-17 Thread Bjorn Bringert
On Mon, May 17, 2010 at 3:00 PM, Olli Pettay olli.pet...@helsinki.fi wrote:
 On 5/17/10 4:05 PM, Bjorn Bringert wrote:

 Back in December there was a discussion about web APIs for speech
 recognition and synthesis that saw a decent amount of interest

 (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
 Based on that discussion, we would like to propose a simple API for
 speech recognition, using a newinput type=speech  element. An
 informal spec of the new API, along with some sample apps and use
 cases can be found at:

 http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhxhl=en.

 It would be very helpful if you could take a look and share your
 comments. Our next steps will be to implement the current design, get
 some feedback from web developers, continue to tweak, and seek
 standardization as soon it looks mature enough and/or other vendors
 become interested in implementing it.


 After a quick read I, in general, like the proposal.

It's pretty underspecified still, as you can see. Thanks for pointing
out some missing pieces.


 Few comments though.

 - What should happen if for example
  What happens to the events which are fired during that time?
  Or should recognition stop?

(Looks like half of the first question is missing, so I'm guessing
here) If you are asking about when the web app loses focus (e.g. the
user switches to a different tab or away from the browser), I think
the recognition should be cancelled. I've added this to the spec.


 - What exactly are grammars builtin:dictation and builtin:search?
  Especially the latter one is not at all clear to me

They are intended to be implementation-dependent large language
models, for dictation (e.g. e-mail writing) and search queries
respectively. I've tried to clarify them a bit in the spec now. There
should perhaps be more of these (e.g. builtin:address), maybe with
some optional, mapping to builtin:dictation if not available.


 - When does recognitionState change? Before which events?

Thanks, that was very underspecified. I've added a diagram to clarify it.


 - It is not quite clear how SGRS works with input type=speech

The grammar specifies the set of utterances that the speech recognizer
should match against. The grammar may be annotated with SISR, which
will be used to populate the 'interpretation' field in ListenResult.

Since grammars may be protected by cookies etc that are only available
in the browsing session, I think the user agent will have to fetch the
grammar and the pass it to the speech recognizer, rather than the
recognizer accessing it directly.

I'm not sure if any of that answers your question though.


 - I believe there is no need for
  DOMImplementation.hasFeature(SpeechInput, 1.0)

The intention was that apps could use this to conditionally enable
features that require speech input support. Is there some other
mechanism that should be used instead?


 And I think we really need to define something for TTS.
 Not every web developers have servers for text - audio.

Yes, I agree. We intend to work on that next, but didn't include it in
this proposal since they are pretty separate features from the browser
point of view.


-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-17 Thread Olli Pettay

On 5/17/10 6:55 PM, Bjorn Bringert wrote:


(Looks like half of the first question is missing, so I'm guessing
here) If you are asking about when the web app loses focus (e.g. the
user switches to a different tab or away from the browser), I think
the recognition should be cancelled. I've added this to the spec.



Oh, where did the rest of the question go.

I was going to ask about alert()s.
What happens if alert() pops up while recognition is on?
Which events should fire and when?



The grammar specifies the set of utterances that the speech recognizer
should match against. The grammar may be annotated with SISR, which
will be used to populate the 'interpretation' field in ListenResult.

I know what grammars are :)
What I meant that it is not very well specified that the result is 
actually put to .value etc.



And still, I'm still not quite sure what builtin:search actually
is. What kind of grammar would that be? How is that different from
builtin:dictation?



-Olli


Re: [whatwg] Speech input element

2010-05-17 Thread James Salsman
On Mon, May 17, 2010 at 8:55 AM, Bjorn Bringert bring...@google.com wrote:

 - What exactly are grammars builtin:dictation and builtin:search?

 They are intended to be implementation-dependent large language
 models, for dictation (e.g. e-mail writing) and search queries
 respectively. I've tried to clarify them a bit in the spec now. There
 should perhaps be more of these (e.g. builtin:address), maybe with
 some optional, mapping to builtin:dictation if not available.

Bjorn, are you interested in including speech recognition support for
pronunciation assessment such as is done by http://englishcentral.com/
, http://www.scilearn.com/products/reading-assistant/ ,
http://www.eyespeakenglish.com/ , and http://wizworldonline.com/ ,
http://www.8dworld.com/en/home.html ?

Those would require different sorts of language models and grammars
such as those described in
http://www.springerlink.com/content/l0385t6v425j65h7/

Please let me know your thoughts.

Best regards,
James Salsman - http://talknicer.com/