Re: [whatwg] Speech input element

2010-06-14 Thread Bjorn Bringert
Based on the feedback in this thread we've worked out a new speech
input proposal that adds a @speech attribute to most input elements,
instead of a new . Please see
http://docs.google.com/View?id=dcfg79pz_5dhnp23f5 for the new
proposal.

/Bjorn Bringert & Satish Sampath

-- 
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-20 Thread Bjorn Bringert
On Thu, May 20, 2010 at 1:32 PM, Anne van Kesteren  wrote:
> On Thu, 20 May 2010 14:29:16 +0200, Bjorn Bringert 
> wrote:
>>
>> It should be possible to drive  with keyboard
>> input, if the user agent chooses to implement that. Nothing in the API
>> should require the user to actually speak. I think this is a strong
>> argument for why  should not be replaced by a
>> microphone API and a separate speech recognizer, since the latter
>> would be very hard to make accessible. (I still think that there
>> should be a microphone API for applications like audio chat, but
>> that's a separate discussion).
>
> So why not implement speech support on top of the existing input types?

Speech-driven keyboards certainly get you some of the benefits of
, but they give the application developer less
control and less information than a speech-specific API. Some
advantages of a dedicated speech input type:

- Application-defined grammars. This is important for getting high
recognition accuracy in with limited domains.

- Allows continuous speech recognition where the app gets events on
speech endpoints.

- Multiple recognition hypotheses. This lets applications implement
intelligent input disambiguation.

- Doesn't require the input element to have keyboard focus while speaking.

- Doesn't require a visible text input field.

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-20 Thread Bjorn Bringert
On Wed, May 19, 2010 at 10:38 PM, David Singer  wrote:
> I am a little concerned that we are increasingly breaking down a metaphor, a 
> 'virtual interface' without realizing what that abstraction buys us.  At the 
> moment, we have the concept of a hypothetical pointer and hypothetical 
> keyboard, (with some abstract states, such as focus) that you can actually 
> drive using a whole bunch of physical modalities.  If we develop UIs that are 
> specific to people actually speaking, we have 'torn the veil' of that 
> abstract interface.  What happens to people who cannot speak, for example? Or 
> who cannot say the language needed well enough to be recognized?

It should be possible to drive  with keyboard
input, if the user agent chooses to implement that. Nothing in the API
should require the user to actually speak. I think this is a strong
argument for why  should not be replaced by a
microphone API and a separate speech recognizer, since the latter
would be very hard to make accessible. (I still think that there
should be a microphone API for applications like audio chat, but
that's a separate discussion).

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-18 Thread Bjorn Bringert
On Tue, May 18, 2010 at 10:27 AM, Satish Sampath  wrote:
>> Well, the problem with alert is that the assumption (which may or may not
>> always hold) is that when alert() is opened, web page shouldn't run
>> any scripts. So should  fire some events when the
>> recognition is canceled (if alert cancels recognition), and if yes,
>> when? Or if recognition is not canceled, and something is recognized
>> (so "input" event should be dispatched), when should the event actually
>> fire? The problem is pretty much the same with synchronous XMLHttpRequest.
>
> In my opinion, once the speech input element has started recording any event
> which takes the user's focus away from actually speaking should ideally stop
> the speech recognition. This would include switching to a new window, a new
> tab or modal/alert dialogs, submitting a form or navigating to a new page in
> the same tab/window.

Yes, I agree with that. The tricky issue, as Olli points out, is
whether and when the 'error' event should fire when recognition is
aborted because the user moves away or gets an alert. What does
XMLHttpRequest do?

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-18 Thread Bjorn Bringert
On Tue, May 18, 2010 at 8:02 AM, Anne van Kesteren  wrote:
> On Mon, 17 May 2010 15:05:22 +0200, Bjorn Bringert 
> wrote:
>>
>> Back in December there was a discussion about web APIs for speech
>> recognition and synthesis that saw a decent amount of interest
>>
>> (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
>> Based on that discussion, we would like to propose a simple API for
>> speech recognition, using a new  element. An
>> informal spec of the new API, along with some sample apps and use
>> cases can be found at:
>>
>> http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhx&hl=en.
>>
>> It would be very helpful if you could take a look and share your
>> comments. Our next steps will be to implement the current design, get
>> some feedback from web developers, continue to tweak, and seek
>> standardization as soon it looks mature enough and/or other vendors
>> become interested in implementing it.
>
> I wonder how it relates to the  proposal already in the draft. In
> theory that supports microphone input too.

It would be possible to implement speech recognition on top of a
microphone input API. The most obvious approach would be to use
 to get an audio stream, and send that audio stream to a
server (e.g. using WebSockets). The server runs a speech recognizer
and returns the results.

Advantages of the speech input element:

- Web app developers do not need to build and maintain a speech
recognition service.

- Implementations can choose to use client-side speech recognition.
This could give reduced network traffic and latency (but probably also
reduced recognition accuracy and language support). Implementations
could also use server-side recognition by default, switching to local
recognition in offline or low bandwidth situations.

- Using a general audio capture API would require APIs for things like
audio encoding and audio streaming. Judging from the past results of
specifying media features, this may be non-trivial. The speech input
element turns all audio processing concerns into implementation
details.

- Implementations can have special UI treatment for speech input,
which may be different from that for general audio capture.


Advantages of using a microphone API:

- Web app developers get complete control over the quality and
features of the speech recognizer. This is a moot point for most
developers though, since they do not have the resources to run their
own speech recognition service.

- Fewer features to implement in browsers (assuming that a microphone
API would be added anyway).

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-18 Thread Bjorn Bringert
On Mon, May 17, 2010 at 10:55 PM, James Salsman  wrote:
> On Mon, May 17, 2010 at 8:55 AM, Bjorn Bringert  wrote:
>>
>>> - What exactly are grammars builtin:dictation and builtin:search?
>>
>> They are intended to be implementation-dependent large language
>> models, for dictation (e.g. e-mail writing) and search queries
>> respectively. I've tried to clarify them a bit in the spec now. There
>> should perhaps be more of these (e.g. builtin:address), maybe with
>> some optional, mapping to builtin:dictation if not available.
>
> Bjorn, are you interested in including speech recognition support for
> pronunciation assessment such as is done by http://englishcentral.com/
> , http://www.scilearn.com/products/reading-assistant/ ,
> http://www.eyespeakenglish.com/ , and http://wizworldonline.com/ ,
> http://www.8dworld.com/en/home.html ?
>
> Those would require different sorts of language models and grammars
> such as those described in
> http://www.springerlink.com/content/l0385t6v425j65h7/
>
> Please let me know your thoughts.

I don't have SpringerLink access, so I couldn't read that article. As
far as I could tell from the abstract, they use phoneme-level speech
recognition and then calculate the edit distance to the "correct"
phoneme sequences. Do you have a concrete proposal for how this could
be supported? Would support for PLS
(http://www.w3.org/TR/pronunciation-lexicon/) links in SRGS be enough
(the SRGS spec already includes that)?

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-18 Thread Bjorn Bringert
On Mon, May 17, 2010 at 9:23 PM, Olli Pettay  wrote:
> On 5/17/10 6:55 PM, Bjorn Bringert wrote:
>
>> (Looks like half of the first question is missing, so I'm guessing
>> here) If you are asking about when the web app loses focus (e.g. the
>> user switches to a different tab or away from the browser), I think
>> the recognition should be cancelled. I've added this to the spec.
>>
>
> Oh, where did the rest of the question go.
>
> I was going to ask about alert()s.
> What happens if alert() pops up while recognition is on?
> Which events should fire and when?

Hmm, good question. I think that either the recognition should be
cancelled, like when the web app loses focus, or it should continue
just as if there was no alert. Are there any browser implementation
reasons to do one or the other?


>> The grammar specifies the set of utterances that the speech recognizer
>> should match against. The grammar may be annotated with SISR, which
>> will be used to populate the 'interpretation' field in ListenResult.
>
> I know what grammars are :)

Yeah, sorry about my silly reply there, I just wasn't sure exactly
what you were asking.


> What I meant that it is not very well specified that the result is actually
> put to .value etc.

Yes, good point. The alternatives would be to use either the
'utterance' or the 'interpretation' value from the most likely
recognition result. If the grammar does not contain semantics, those
are identical, so it doesn't matter in that case. If the developer has
added semantics to the grammar, the interpretation is probably more
interesting than the utterance. So my conclusion is that it would make
most sense to store the interpretation in @value. I've updated the
spec with better definitions of @value and @results.


> And still, I'm still not quite sure what builtin:search actually
> is. What kind of grammar would that be? How is that different from
> builtin:dictation?

To be useful, those should probably be large statistical language
models (e.g. n-gram models) trained on different corpora. So
"builtin:dictation" might be trained on a corpus containing e-mails,
SMS messages and news text, and "builtin:search" might be trained on
query strings from a search engine. I've updated the spec to make
"builtin:search" optional, mapping to "builtin:dictation" if not
implemented. The exact language matched by these models would be
implementation dependent, and implementations may choose to be clever
about them. For example by:

- Dynamic tweaking for different web apps based on the user's previous
inputs and the text contained in the web app.

- Adding the names of all contacts from the user's address book to the
dictation model.

- Weighting place names based on geographic proximity (in an
implementation that has access to the user's location).


-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Speech input element

2010-05-17 Thread Bjorn Bringert
On Mon, May 17, 2010 at 3:00 PM, Olli Pettay  wrote:
> On 5/17/10 4:05 PM, Bjorn Bringert wrote:
>>
>> Back in December there was a discussion about web APIs for speech
>> recognition and synthesis that saw a decent amount of interest
>>
>> (http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
>> Based on that discussion, we would like to propose a simple API for
>> speech recognition, using a new  element. An
>> informal spec of the new API, along with some sample apps and use
>> cases can be found at:
>>
>> http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhx&hl=en.
>>
>> It would be very helpful if you could take a look and share your
>> comments. Our next steps will be to implement the current design, get
>> some feedback from web developers, continue to tweak, and seek
>> standardization as soon it looks mature enough and/or other vendors
>> become interested in implementing it.
>>
>
> After a quick read I, in general, like the proposal.

It's pretty underspecified still, as you can see. Thanks for pointing
out some missing pieces.


> Few comments though.
>
> - What should happen if for example
>  What happens to the events which are fired during that time?
>  Or should recognition stop?

(Looks like half of the first question is missing, so I'm guessing
here) If you are asking about when the web app loses focus (e.g. the
user switches to a different tab or away from the browser), I think
the recognition should be cancelled. I've added this to the spec.


> - What exactly are grammars builtin:dictation and builtin:search?
>  Especially the latter one is not at all clear to me

They are intended to be implementation-dependent large language
models, for dictation (e.g. e-mail writing) and search queries
respectively. I've tried to clarify them a bit in the spec now. There
should perhaps be more of these (e.g. builtin:address), maybe with
some optional, mapping to builtin:dictation if not available.


> - When does recognitionState change? Before which events?

Thanks, that was very underspecified. I've added a diagram to clarify it.


> - It is not quite clear how SGRS works with 

The grammar specifies the set of utterances that the speech recognizer
should match against. The grammar may be annotated with SISR, which
will be used to populate the 'interpretation' field in ListenResult.

Since grammars may be protected by cookies etc that are only available
in the browsing session, I think the user agent will have to fetch the
grammar and the pass it to the speech recognizer, rather than the
recognizer accessing it directly.

I'm not sure if any of that answers your question though.


> - I believe there is no need for
>  DOMImplementation.hasFeature("SpeechInput", "1.0")

The intention was that apps could use this to conditionally enable
features that require speech input support. Is there some other
mechanism that should be used instead?


> And I think we really need to define something for TTS.
> Not every web developers have servers for text -> .

Yes, I agree. We intend to work on that next, but didn't include it in
this proposal since they are pretty separate features from the browser
point of view.


-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


[whatwg] Speech input element

2010-05-17 Thread Bjorn Bringert
Back in December there was a discussion about web APIs for speech
recognition and synthesis that saw a decent amount of interest
(http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-December/thread.html#24281).
Based on that discussion, we would like to propose a simple API for
speech recognition, using a new  element. An
informal spec of the new API, along with some sample apps and use
cases can be found at:
http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhx&hl=en.

It would be very helpful if you could take a look and share your
comments. Our next steps will be to implement the current design, get
some feedback from web developers, continue to tweak, and seek
standardization as soon it looks mature enough and/or other vendors
become interested in implementing it.

-- 
Bjorn Bringert & Satish Sampath

Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-15 Thread Bjorn Bringert
.html>.
>>> Amazing how shared ideas like these seem to arise independently at the same
>>> time.
>>>
>>> I have a use-case and an additional requirement, that the time indices be
>>> made available for when each word is spoken in the TTS-generated audio:
>>>
>>>> I've been working on a web app which reads text in a web page,
>>>> highlighting each word as it is read. For this to be possible, a
>>>> Text-To-Speech API is needed which is able to:
>>>> (1) generate the speech audio from some text, and
>>>> (2) include the time indicies for when each of the words in the text is
>>>> spoken.
>>>
>>> I foresee that a TTS API should integrate closely with the HTML5 Audio
>>> API. For example, invoking a call to the API could return a "TTS" object
>>> which has an instance of Audio, whose interface could be used to navigate
>>> through the TTS output. For example:
>>>
>>> var tts = new TextToSpeech("Hello, World!");
>>> tts.audio.addEventListener("canplaythrough", function(e){
>>>     //tts.indices == [{startTime:0, endTime:500, text:"Hello"},
>>> {startTime:500, endTime:1000, text:"World"}]
>>> }, false);
>>> tts.read(); //invokes tts.audio.play
>>>
>>> What would be even cooler, is if the parameter passed to the TextToSpeech
>>> constructor could be an Element or TextNode, and the indices would then
>>> include a DOM Range in addition to the "text" property. A flag could also be
>>> set which would result in each of these DOM ranges to be selected when it is
>>> read. For example:
>>>
>>> var tts = new TextToSpeech(document.querySelector("article"));
>>> tts.selectRangesOnRead = true;
>>> tts.audio.addEventListener("canplaythrough", function(e){
>>>     /*
>>>     tts.indices == [
>>>     {startTime:0, endTime:500, text:"Hello", range:Range},
>>>     {startTime:500, endTime:1000, text:"World", range:Range}
>>>     ]
>>>     */
>>> }, false);
>>> tts.read();
>>>
>>> In addition to the events fired by the Audio API, more events could be
>>> fired when reading TTS, such as a "readrange" event whose event object would
>>> include the index (startTime, endTime, text, range) for the range currently
>>> being spoken. Such functionality would make the ability to "read along" with
>>> the text trivial.
>>>
>>> What do you think?
>>> Weston
>>>
>>> On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert 
>>> wrote:
>>>>
>>>> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking  wrote:
>>>> > On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert 
>>>> > wrote:
>>>> >> I agree that being able to capture and upload audio to a server would
>>>> >> be useful for a lot of applications, and it could be used to do
>>>> >> speech
>>>> >> recognition. However, for a web app developer who just wants to
>>>> >> develop an application that uses speech input and/or output, it
>>>> >> doesn't seem very convenient, since it requires server-side
>>>> >> infrastructure that is very costly to develop and run. A
>>>> >> speech-specific API in the browser gives browser implementors the
>>>> >> option to use on-device speech services provided by the OS, or
>>>> >> server-side speech synthesis/recognition.
>>>> >
>>>> > Again, it would help a lot of you could provide use cases and
>>>> > requirements. This helps both with designing an API, as well as
>>>> > evaluating if the use cases are common enough that a dedicated API is
>>>> > the best solution.
>>>> >
>>>> > / Jonas
>>>>
>>>> I'm mostly thinking about speech web apps for mobile devices. I think
>>>> that's where speech makes most sense as an input and output method,
>>>> because of the poor keyboards, small screens, and frequent hands/eyes
>>>> busy situations (e.g. while driving). Accessibility is the other big
>>>> reason for using speech.
>>>>
>>>> Some ideas for use cases:
>>>>
>>>> - Search by speaking a query
>>>> - Speech-to-speech translation
>>>> - Voice Dialing (could open a tel: URI to actually make the call)
>>>> - Dialog systems (e.g. the canonical pizza ordering system)
>>>> - Lightweight JavaScript browser extensions (e.g. Greasemonkey /
>>>> Chrome extensions) for using speech with any web site, e.g, for
>>>> accessibility.
>>>>
>>>> Requirements:
>>>>
>>>> - Web app developer side:
>>>>   - Allows both speech recognition and synthesis.
>>>>   - Easy to use API. Makes simple things easy and advanced things
>>>> possible.
>>>>   - Doesn't require web app developer to develop / run his own speech
>>>> recognition / synthesis servers.
>>>>   - (Natural) language-neutral API.
>>>>   - Allows developer-defined application specific grammars / language
>>>> models.
>>>>   - Allows multilingual applications.
>>>>   - Allows easy localization of speech apps.
>>>>
>>>> - Implementor side:
>>>>   - Easy enough to implement that it can get wide adoption in browsers.
>>>>   - Allows implementor to use either client-side or server-side
>>>> recognition and synthesis.
>>>>
>>>> --
>>>> Bjorn Bringert
>>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>>> Palace Road, London, SW1W 9TQ
>>>> Registered in England Number: 3977902
>>>
>>
>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-11 Thread Bjorn Bringert
Thanks for the discussion - cool to see more interest today also
(http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html)

I've hacked up a proof-of-concept JavaScript API for speech
recognition and synthesis. It adds a navigator.speech object with
these functions:

void listen(ListenCallback callback, ListenOptions options);
void speak(DOMString text, SpeakCallback callback, SpeakOptions options);

The implementation uses an NPAPI plugin for the Android browser that
wraps the existing Android speech APIs. The code is available at
http://code.google.com/p/speech-api-browser-plugin/

There are some simple demo apps in
http://code.google.com/p/speech-api-browser-plugin/source/browse/trunk/android-plugin/demos/
including:

- English to Spanish speech-to-speech translation
- Google search by speaking a query
- The obligatory pizza ordering system
- A phone number dialer

Comments appreciated!

/Bjorn

On Fri, Dec 4, 2009 at 2:51 PM, Olli Pettay  wrote:
> Indeed the API should be something significantly simpler than X+V.
> Microsoft has (had?) support for SALT. That API is pretty simple and
> provides speech recognition and TTS.
> The API could be probably even simpler than SALT.
> IIRC, there was an extension for Firefox to support SALT (well, there was
> also an extension to support X+V).
>
> If the platform/OS provides ASR and TTS, adding a JS API for it should
> be pretty simple. X+V tries to handle some logic using VoiceXML FIA, but
> I think it would be more web-like to give pure JS API (similar to SALT).
> Integrating visual and voice input could be done in scripts. I'd assume
> there would be some script libraries to handle multimodal input integration
> - especially if there will be touch and gestures events too etc. (Classic
> multimodal map applications will become possible in web.)
>
> But this all is something which should be possibly designed in or with W3C
> multimodal working group. I know their current architecture is way more
> complex, but X+X, SALT and even Multimodal-CSS has been discussed in that
> working group.
>
>
> -Olli
>
>
>
> On 12/3/09 2:50 AM, Dave Burke wrote:
>>
>> We're envisaging a simpler programmatic API that looks familiar to the
>> modern Web developer but one which avoids the legacy of dialog system
>> languages.
>>
>> Dave
>>
>> On Wed, Dec 2, 2009 at 7:25 PM, João Eiras > <mailto:jo...@opera.com>> wrote:
>>
>>    On Wed, 02 Dec 2009 12:32:07 +0100, Bjorn Bringert
>>    mailto:bring...@google.com>> wrote:
>>
>>        We've been watching our colleagues build native apps that use
>> speech
>>        recognition and speech synthesis, and would like to have JavaScript
>>        APIs that let us do the same in web apps. We are thinking about
>>        creating a lightweight and implementation-independent API that lets
>>        web apps use speech services. Is anyone else interested in that?
>>
>>        Bjorn Bringert, David Singleton, Gummi Hafsteinsson
>>
>>
>>    This exists already, but only Opera supports it, although there are
>>    problems with the library we use for speech recognition.
>>
>>    http://www.w3.org/TR/xhtml+voice/
>>
>>  http://dev.opera.com/articles/view/add-voice-interactivity-to-your-site/
>>
>>    Would be nice to revive that specification and get vendor buy-in.
>>
>>
>>
>>    --
>>
>>    João Eiras
>>    Core Developer, Opera Software ASA, http://www.opera.com/
>>
>>
>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-03 Thread Bjorn Bringert
On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking  wrote:
> On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert  wrote:
>> I agree that being able to capture and upload audio to a server would
>> be useful for a lot of applications, and it could be used to do speech
>> recognition. However, for a web app developer who just wants to
>> develop an application that uses speech input and/or output, it
>> doesn't seem very convenient, since it requires server-side
>> infrastructure that is very costly to develop and run. A
>> speech-specific API in the browser gives browser implementors the
>> option to use on-device speech services provided by the OS, or
>> server-side speech synthesis/recognition.
>
> Again, it would help a lot of you could provide use cases and
> requirements. This helps both with designing an API, as well as
> evaluating if the use cases are common enough that a dedicated API is
> the best solution.
>
> / Jonas

I'm mostly thinking about speech web apps for mobile devices. I think
that's where speech makes most sense as an input and output method,
because of the poor keyboards, small screens, and frequent hands/eyes
busy situations (e.g. while driving). Accessibility is the other big
reason for using speech.

Some ideas for use cases:

- Search by speaking a query
- Speech-to-speech translation
- Voice Dialing (could open a tel: URI to actually make the call)
- Dialog systems (e.g. the canonical pizza ordering system)
- Lightweight JavaScript browser extensions (e.g. Greasemonkey /
Chrome extensions) for using speech with any web site, e.g, for
accessibility.

Requirements:

- Web app developer side:
   - Allows both speech recognition and synthesis.
   - Easy to use API. Makes simple things easy and advanced things possible.
   - Doesn't require web app developer to develop / run his own speech
recognition / synthesis servers.
   - (Natural) language-neutral API.
   - Allows developer-defined application specific grammars / language models.
   - Allows multilingual applications.
   - Allows easy localization of speech apps.

- Implementor side:
   - Easy enough to implement that it can get wide adoption in browsers.
   - Allows implementor to use either client-side or server-side
recognition and synthesis.

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Bjorn Bringert
I agree that being able to capture and upload audio to a server would
be useful for a lot of applications, and it could be used to do speech
recognition. However, for a web app developer who just wants to
develop an application that uses speech input and/or output, it
doesn't seem very convenient, since it requires server-side
infrastructure that is very costly to develop and run. A
speech-specific API in the browser gives browser implementors the
option to use on-device speech services provided by the OS, or
server-side speech synthesis/recognition.

/Bjorn

On Wed, Dec 2, 2009 at 6:23 PM, Diogo Resende  wrote:
> I missunderstood too. It would be great to have the ability to access
> the microphone and record+upload or stream sound to the web server.
>
> --
> D.
>
>
> On Wed, 2009-12-02 at 10:04 -0800, Jonas Sicking wrote:
>> On Wed, Dec 2, 2009 at 9:17 AM, Bjorn Bringert  wrote:
>> > I think that it would be best to extend the browser with a JavaScript
>> > speech API intended for use by web apps. That is, only web apps that
>> > use the speech API would have speech support. But it should be
>> > possible to use such an API to write browser extensions (using
>> > Greasemonkey, Chrome extensions etc) that allow speech control of the
>> > browser and speech synthesis of web page contents. Doing it the other
>> > way around seems like it would reduce the flexibility for web app
>> > developers.
>>
>> Hmm.. I guess I misunderstood your original proposal.
>>
>> Do you want the browser to expose an API that converts speech to text?
>> Or do you want the browser to expose access to the microphone so that
>> you can do speech to text convertion in javascript?
>>
>> If the former, could you describe your use cases in more detail?
>>
>> / Jonas
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


Re: [whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Bjorn Bringert
I think that it would be best to extend the browser with a JavaScript
speech API intended for use by web apps. That is, only web apps that
use the speech API would have speech support. But it should be
possible to use such an API to write browser extensions (using
Greasemonkey, Chrome extensions etc) that allow speech control of the
browser and speech synthesis of web page contents. Doing it the other
way around seems like it would reduce the flexibility for web app
developers.

/Bjorn

On Wed, Dec 2, 2009 at 4:55 PM, Mike Hearn  wrote:
> Is speech support a feature of the web page, or the web browser?
>
> On Wed, Dec 2, 2009 at 12:32 PM, Bjorn Bringert  wrote:
>> We've been watching our colleagues build native apps that use speech
>> recognition and speech synthesis, and would like to have JavaScript
>> APIs that let us do the same in web apps. We are thinking about
>> creating a lightweight and implementation-independent API that lets
>> web apps use speech services. Is anyone else interested in that?
>>
>> Bjorn Bringert, David Singleton, Gummi Hafsteinsson
>>
>> --
>> Bjorn Bringert
>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>> Palace Road, London, SW1W 9TQ
>> Registered in England Number: 3977902
>>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902


[whatwg] Web API for speech recognition and synthesis

2009-12-02 Thread Bjorn Bringert
We've been watching our colleagues build native apps that use speech
recognition and speech synthesis, and would like to have JavaScript
APIs that let us do the same in web apps. We are thinking about
creating a lightweight and implementation-independent API that lets
web apps use speech services. Is anyone else interested in that?

Bjorn Bringert, David Singleton, Gummi Hafsteinsson

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902