Hi all,

OK, I said I would follow up on the voice chat thing and here it is in
outline, but more detailed than my original message back in March '98 on
this list.

Voice has 2 great advantages over text: immediacy and expressiveness.

It also has some great disadvantages: it is a bandwith hog, and when stored
it takes up too much room and is almost impossible to search for content.

Current speech-to-text systems go too far to be of great use, I believe.
They work really hard on extracting the words from a voice but discard all
the  helpful tonal and accenting info. And text-to-speech has little more
than novelty value to any but the blind. Converting text to speech uses
rules that give expressive rise and fall to the voice, but only according
to an algorithm -- not how the original speaker might have done. Irony,
imitation, warning, rhetorical questions all lose their deeper meanings.
Also the rules for pronunciation can easily be foiled by words such as
"live" ("live bait" vs "live your life").

In my voice chat idea the program extracts the phonemes from the voice and
sends them as a specially coded text stream. This wouldn't require any of
the really difficult context-sensitive work of picking out the words from
the stream -- it would just send the whole mess down the line as phonemes
with pitch and volume info as it received it. This text stream would go
thru the server just as text chat currently does. There are a few schemes
already for representing phonemes using ordinary ASCII characters (for
example the Amiga computer's text-to-speech system uses an expanded version
of ARPA's Arpabet).

Each person has their voice-texture -- a simple list of volumes at about a
dozen frequencies -- by running an FFT (Fast Fourier Transform) on their
voice. There are at least 3 ways that the voice texture could be sampled:
        * it could be captured beforehand explicitly during a certain
          sound (probably "uh") on a tiny window of voice sound about .05
          of a second long.
        * it could be resampled perhaps from time to time during a session
          so that a gradual shift in voice would be detectable. This
          dynamic voice-texture resampling is much more difficult of course
          because you need to work out when a useful sound is being said.
        * a task could run in the background continually monitoring the
          voice frequencies and averaging them.

This last solution avoids the problems of a once-off training time, and
having the computer choose suitable sounds from the stream, but it would
also bleed cycles from the cpu... cycles that are needed by 3D processing.

The receiving machine maps the voice texture onto the phoneme stream, and
modifies it by the pitch and volume info. 

One of the nice things about this is that it would even preserve different
styles of speech (for instance you would hear me talk in my Australian
accent), and untranslateable things like a yawn or a frustrated moan would
come thru somewhat intact. There might be some problems with things like
laughter where the quality of your voice can change suddenly, but I am sure
there are simple ways to cope with this... like storing more than one
voice-texture for a person and being able to switch back and forth between
them. (Normal, whisper, shout, laugh, etc.)

This system would give very realistic voice transmission using extremely
low bandwidth. It would also be simple to store and ways could even be
found to search it fairly reliably.

This uses computers for what they are very good at (high speed encoding)
and avoids the stuff that they are terrible at (contextual interpretation
of meaning).

Bear in mind that voice should not be used to replace text -- just to
supplement it. There are certain things which are much better done via text
than voice, just as the reverse is also true.

More later.

Comments anyone?

Cheers,

        - Miriam

-----------------------------------------------------------------
http://werple.net.au/~miriam/

Virtual Reality Association (VRA)
Melbourne, Australia
http://www.vr.org.au/

Reply via email to