I have tried accessing the samples you provided. Only one of them
loaded and played. It did not sound anything like speech. The TRM
is simply the waveguide model of an acoustic tube, with control
regions applied according to the Distinctive Region Model developed
by Carré, based on earlier work by Fant. The underlying theory is
outlined in the paper "Real-time articulatory speech-synthesis-by-
rules" on my university web site and referenced from the gnuspeech
project site (see below for the university web site URL). Manuals
for "Synthesiser" and "Monet" also appear on that web site, towards
the end of section E of the published papers page. In the Monet
manual there is a table showing the equivalences between IPS symbols
and the Monet symbols. This should allow you to translate into the
Festival set.
Monet is an interactive tool for developing data sets for arbitrary
languages. Real-time Monet (which has not yet been ported) is the
heart of a daemon that uses these data sets to convert text to
speech. It is a stripped down version of Monet and it would be
really nice if someone would take on that task (please ;-). Without
the data sets, and the algorithms for manipulating the parameters
tracks, you don't have a speech synthesiser, you have a rather
specialised trumpet!
The data sets developed for synthesis in "diphones.monet" were
developed based on several years of research in which British English
speech was analysed for sound data, rhythmic (duration) data, and
intonation data. This research is reported in other papers on the site.
If you would like to hear some samples of gnuspeech, go to my
university web site:
http://www.cpsc.ucalgary.ca/~hill
click on "Published papers" in the left menu and click on the first
paper in section "B. National and international invited
contributions ..." and select the first paper in that section (it is
the one referred to above).
at the bottom of the left side menu in the resulting page you will
find a whole bunch of examples of gnuspeech synthesis. Some short,
some long.
The tube resonance model parameters are specified in the source code
for the TRM. I attach a sample set of parameters, 24 fixed
(utterance-rate) parameters, and 6 blocks of 16 parameters that drive
the so-called "speech-rate" parameters. The utterance represented is
"oi", lasting about 1.5 seconds (the input control rate is 4 herz and
there are 6 blocks).
The 16 speech-rate parameters are, in order:
GlottalPitch, GlottalVolume, AspirationVolume, FricativeVolume,
FricativePosition, FricativeCentreFrequency,
FricativeBandWidth, radius1, radius2, radius3, raduius4, radius5,
radius6, radius7, radius8, velumRadius
I hope this helps.
--------
David Hill
Simplicity, patience, compassion. These three are your greatest
treasures (Tao Te Ching #67)
---------
On Feb 8, 2007, at 9:20 AM, Nickolay V. Shmyrev wrote:
Heh, since it seems it would be hard to build Monet on Linux I've
tried
to adopt trm to work with festival. Actually for me it seems it
would be
interesting work, since festival predicts intonation and duration much
more precisely and is able to produce very good annotations.
----------
Parameter set for TRM "oi" (1.5 seconds, falling pitch during diphthong)
------------------------------------------------------------------------
------------------------
4 ; input control rate (1 - 1000 Hz)
60.0 ; master volume (0 - 60 dB)
1 ; number of sound output channels (1 or 2)
0.0 ; stereo balance (-1 to +1)
0 ; glottal source waveform type (0 = pulse, 1 = sine)
40.0 ; glottal pulse rise time (5 - 50 % of GP period)
22.0 ; glottal pulse fall time minimum (5 - 50 % of GP period)
45.0 ; glottal pulse fall time maximum (5 - 50 % of GP period)
2.50 ; glottal source breathiness (0 - 10 % of GS amplitude)
10.0 ; nominal tube length (10 - 20 cm)
32 ; tube temperature (25 - 40 degrees celsius)
1.00 ; junction loss factor (0 - 5 % of unity gain)
3.05 ; aperture scaling radius (3.05 - 12 cm)
0.75 ; mouth aperture coefficient (0 - 0.99)
0.72 ; nose aperture coefficient (0 - 0.99)
1.35 ; radius of nose section 1 (0 - 3 cm)
1.96 ; radius of nose section 2 (0 - 3 cm)
1.91 ; radius of nose section 3 (0 - 3 cm)
1.3 ; radius of nose section 4 (0 - 3 cm)
0.73 ; radius of nose section 5 (0 - 3 cm)
1500.0 ; throat lowpass frequency cutoff (50 - nyquist Hz)
6.0 ; throat volume (0 - 48 dB)
1 ; pulse modulation of noise (0 = off, 1 = on)
48.0 ; noise crossmix offset (30 - 60 db)
10.0 0.0 0.0 0.0 4.0 4400 600 0.8
0.8 0.4 0.4 1.78 1.78 1.26 0.8 0.0
9.5 54.0 0.0 0.0 4.0 4400 600 0.8
0.8 0.4 0.4 1.78 1.78 1.26 0.8 0.0
9.0 60.0 0.0 0.0 4.0 4400 600 0.8
0.8 0.6 0.6 1.58 1.58 1.13 1.01 0.1
8.5 60.0 0.0 0.0 4.0 4450 550 0.8
0.8 1.28 1.28 1.0 1.0 1.0 0.8 1.0
8.0 54.0 0.0 0.0 4.0 4500 500 0.8
0.8 1.68 1.58 0.8 0.8 0.5 0.4 1.0
7.0 51.0 0.0 0.0 4.0 4500 500 0.8
0.8 1.78 1.78 0.2 0.2 0.4 0.0 1.0
----------
_______________________________________________
gnuspeech-contact mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/gnuspeech-contact