Re: Making a Vintage Sounding TTS Voice

AudioGames . net Forum — Off-topic room : musicalman via Audiogames-reflector Sun, 20 Jan 2019 17:53:52 -0800

I would've liked to respond to this sooner, but was having issues logging into the site. Thankfully it's sorted out.

Your research into this is really wild to me. I'm no expert in these things, but I am really interested in this stuff. I only wish I could code so that I could either try to build something on my own or contribute to yours .

One thing I meant to say earlier is that, at least given the examples on your site, I think that some of the transitions as you said are a little messy. One thing that particularly stuck out to me was the transition from voiced to unvoiced. It sounded too long to me, as though the synth were drunk. Lol. Of course the lengths of these transitions could change with different speaking rates, but as a bass line value, it sounds too long in proportion to vowels to me. Not sure if your most recent question will tie into that, but thought I'd mention it anyway.

As to the transition between phones, if I am understanding you correctly, you want to know if a static transition time between any two phones is most suitable? My lamon understanding of synthesis tells me that it would be fine, at least as a starting point, to use the same transition time between consonants and vowels. With diphthongs, or two adjacent vowels, I'm not so sure but this would be a Spam? for me to give it some thought.

A long long time ago, when I was a young teenager with nothing better to do, I did play around with this stuff in Dectalk, and I also had an old program called Flex Voice (which btw if anyone has Flex Voice, please get in touch). I wish I'd kept my singing synth stuff around but sadly I either lost most of it or deleted it.

In both synths, when singing or doing complex manual phoneme work, you specify phonemes and durations. The phonemes are exactly like those in the phone list in your post, though obviously the codes were not the same, but it's the same principal. You even specified duration in milliseconds, which is mainly why I never continued my singing efforts because syncing it with music would be difficult, not to mention there is at least one bug I know of in Dectalk that messes with the millisecond values of certain phonemes. Anyway, I found that setting consonants like b, d, f, g, h, k etc. to 60 milliseconds generally worked well if I remember right, and I don't remember having to change them. Now for words like grass, that have multiple consonants at the beginning, I can't remember what I did. I also don't remember, in a musical setting, whether I wanted to put the consonant on the beat, or the vowel on the beat. Something tells me the latter is more appropriate. For your speech synth efforts, that obviously doesn't matter since the transition will happen in either case, but it's an interesting question when you're trying to do singing.

As for diphthong transitions, they were always controlled by the synthesizer, which for a geek like me does get a little annoying when it transitions in a way you don't particularly want in the context of what you're doing. For example, let's say I want to make the word ice, with a long duration. . Some synths will stretch the duration from ah to ih. The stretch will be proportional, meaning that if you make it 5 seconds long, you'll get a very slow transition. If I remember right, Dectalk has a maximum length for this transition and will just hang on the ih sound at the end for the rest of the phoneme. I suspect other synths will do the inverse, that is, prolong the ah and do the transition at the end, which sounds somewhat commical. As to which one is best, I can't really say, especially for normal speech.

All of this reflection on speech synths reminds me of my own efforts to explore the concept, but being the techy/musical person I am, I like putting things in a musical context, as you can probably tell from this post.

I don't know if you've ever heard of a machine called the Voder. The Voder is an old machine from the late 1930s iirc. It was never commercially used but was more an educational contraption built to show that with the cutting-edge technology of the time, it was possible to synthesize speech. I think the research that went into synthesizing speech with the Voder also helped in building the vocoder concept. Anyway the Voder supposedly had a complex control board to adjust analog circuitry (filters, oscillators etc) to make speech sounds. Of course it wasn't an automatic tts. To make any sort of speech you had to learn precisely how to move the controllers to produce different phonemes, and this required many months of training to master.

One day I was exceptionally bored and started a script for a musical instrument player called Sforzando with the goal of making a singing synth, and its usage is, roughly, the same idea as the Voder. To make things complicated, each formant filter's frequency is assigned to a different control knob, mainly because in Sforzando there is almost no way I could do any complex transitions automatically so it has to be done manually. As for unvoiced sounds, I haven't at all decided how to tackle that. So this thing would never be useful to anyone unless they wanted to experiment, but it's cool that we are approaching speech synthesis from totally different angles. I am already learning a lot from reading about your efforts. The difference between parallel and cascade filter setups is really interesting to me, as well as your take on synthesizing unvoiced consonants. I fully relate to how tricky it can be to EQ noise in such a way as to creat consonants and am wondering how you did it. Will be especially interested to see how you make f and th distinguishable, but the better formant synths make valliant attempts (at least Eloquence does, and that's the one whose sound I know best).

Well that's enough rambling from me. Hope it was at least a fun read anyway.

-- 
Audiogames-reflector mailing list
Audiogames-reflector@sabahattin-gucukoglu.com
https://sabahattin-gucukoglu.com/cgi-bin/mailman/listinfo/audiogames-reflector

Re: Making a Vintage Sounding TTS Voice

Re: Making a Vintage Sounding TTS Voice

Reply via email to