On Thursday, 5 May 2016 at 23:47:15 UTC, H. S. Teoh wrote:
Rule-based letter-to-sound systems don't work too well for
English precisely because you have to basically reproduce 500
years' worth of sound change plus all the exceptions introduced
by words borrowed from other contemporous languages across the
centuries. A rule-based system possibly could work, provided
the rules were extensive enough (and multi-layered, to account
for borrowed exceptions and other oddities). But there comes a
point where even the most industrious programmer would throw up
his hands and say, forget this exercise in futility, let's just
have the machine teach itself instead.
It's not just sound changes, English is just weird from a
non-native speaker's point of view. As Kurt Tucholsky, one of the
best German writers ever, once said, English is a simple and a
difficult language at the same time. It consists of foreign words
that are pronounced wrongly. English pronunciation makes any
speaker of a Latin language cringe. In many European languages,
and certainly in Latin languages, the letter-to-sound
correspondence is more or less one-to-one: <a> is /a/, <e> is /e/
etc. In English it's often /ei/ and /i:/. <i> is often /ai/ (of
for f**k's sake!): "emeritus", a Latin word, is pronounced
/e.'me(:).ri.tus/, in English it's /em@.'rai.d@s/. This just
makes you cringe. Native speakers of English often don't realize
how weird their pronunciation sounds to those who natively speak
the language they borrowed the words from (around 60% of the
words). Makes me laugh when I hear English speakers who say "Oh,
there is no Irish word for 'afterhours'!?" - Well, what's the
English for "restaurant", "evict", "condone", "depot", "deposit"
... and what's the English for "language"?
Rule-based systems work better for Spanish because the
orthography is much closer to actual pronunciation, and other
parameters such as stress is more predictable. I'd venture to
guess that rule-based systems might not work as well for
Russian, in spite of the orthography being almost 1-to-1 with
actual pronunciation, because of unpreditable stress positions
which can fundamentally alter vowel values. At best, you'd need
a database of stress patterns for various words so that the
accent would fall in the correct places. Plus a set of
exceptions for certain archaic word combinations that have
unusual stress. If you had a database of English stress
positions, I think half the battle is already won.
French would have the same problem as English, except that you
could just do as a first approximation:
if (rand() > someFactor)
word = word[0 .. $/2];
and then touch it up with a small set of exceptions. :-P
T
Are Russian stress-rules based on context? Long vs. short vowels,
palatalized vs. velarized consonants etc.? If yes, you can
program rules.