It would probably help your understanding if you were to read the
Monet manual. You wrote (see below):
But I have no idea how Monet
reproduces consonants. There are examples, but no trm files for them.
The .trm files are associated strictly with the tube model ("trm" =
"tube resonance model") and are saved and used by the "Synthesiser"
application (which is a GUI application for playing with the tube --
but only steady state configurations). (You should probably read
that manual as well). Consonants are mostly created by the dynamics
of the vocal tract changes, though there are some continuant sounds
such as frication as well (e.g. /s/) but even for these transitional
cues are important. Thus it is impossible to create consonants
from .trm files alone. They were really only useful in exploring the
vocal tract configurations needed to create the vocal tract
"postures" needed as anchor points (loosely related to "phones" for
the varying speech parameters. The dynamic information needed for
complete speech is created from these quasi-steady-state values
representing vocal tract postures, plus context sensitive rules for
moving from posture to posture, according to timing information that
reflects the rhythmic character of British English. This information
is all held within "diphones.monet" (the rules are actually more
complex than diphones in many cases and include triphones & even
tetraphones). Monet has the algorithms to use this information
appropriately. The intonation is applied to the varying stream of
tube parameters generated on this basis according to a model of
British English intonation based on work by M.A.K. Halliday and
elaborated by our own studies by varying the pitch (Fo) parameter,
but these variations are added to small pitch changes created at the
posture (segmental) level by constrictions in the vocal tract -- so-
called "micro-intonation -- which provide additional cues for the
identification of consonants. Many of the relevant papers are
available on my university web site.
The "oi" sound is just a succession of vowel sounds with a varying
pitch, so a series of what appear to be .trm values will work. To
produce speech, you need to be able to construct a more complex set
of varying parameters reflecting the reality of speech. This is what
Monet does. This is the part of Monet that needs to be extracted if
all you wish to do is convert sound specifications to a speech
waveform specification. The current Monet does much more since it
allows you to create the databases as well as listen to the speech
that can then be produced. The extracted part (non-interactive) that
would simply use the databases to convert streams of posture symbols
to an output waveform is what we call "Real-time Monet". It has not
been ported from the original NeXT implementation yet.
david
On Feb 11, 2007, at 1:06 PM, Nickolay V. Shmyrev wrote:
В Сбт, 10/02/2007 в 15:53 -0800, David Hill пишет:
I have tried accessing the samples you provided. Only one of them
loaded and played. It did not sound anything like speech. The
TRM is
simply the waveguide model of an acoustic tube, with control regions
applied according to the Distinctive Region Model developed by
Carré,
based on earlier work by Fant. The underlying theory is outlined in
the paper "Real-time articulatory speech-synthesis-by-rules" on my
university web site and referenced from the gnuspeech project site
(see below for the university web site URL). Manuals for
"Synthesiser" and "Monet" also appear on that web site, towards the
end of section E of the published papers page. In the Monet manual
there is a table showing the equivalences between IPS symbols and the
Monet symbols. This should allow you to translate into the Festival
set.
Ok, thanks, I'll do
Monet is an interactive tool for developing data sets for arbitrary
languages. Real-time Monet (which has not yet been ported) is the
heart of a daemon that uses these data sets to convert text to
speech.
It is a stripped down version of Monet and it would be really nice if
someone would take on that task (please ;-). Without the data sets,
and the algorithms for manipulating the parameters tracks, you don't
have a speech synthesiser, you have a rather specialised trumpet!
Well, I can do that. I just need more explanation. Is it something
Steve
splitted in Framework dir? Currently Monet compiles file, only gorm
files are missing. I don't think sound is required btw, it's enough to
be able to save audio file.
The data sets developed for synthesis in "diphones.monet" were
developed based on several years of research in which British English
speech was analysed for sound data, rhythmic (duration) data, and
intonation data. This research is reported in other papers on the
site.
Btw, have you heart about MOSHA database?
http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html
It seems that Alan already used it in unit-selection synthesis.
Although
it's not free I suppose, that's why this work isn't available
still. If
it will be possible to generate set of prompts (around 1000 will be
enough I suppose) with Monet and later process coefficients with
unit-selection that would be interesting thing I suppose.
If you would like to hear some samples of gnuspeech, go to my
university web site:
Yeah, I've downloaded them, but the problem is that I can reproduce
vowels, like in example "oi" you've sent. But I have no idea how Monet
reproduces consonants. There are examples, but no trm files for them.
And the examples I have (for instance the one Steve kindly sent to
me),
they sound like trumpet as you've noticed :) That's why I suspect
there
is a bug in trm that makes consonants generation impossible.
_______________________________________________
gnuspeech-contact mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/gnuspeech-contact