--- In FairfieldLife@yahoogroups.com, Bhairitu <noozg...@...> wrote: > > As as software professional and developer/designer my first step > would be to buy some time from linguistics experts. I would also > brainstorm with other developers on solutions to the problem of > language translation. Then you would want to do some modeling on > several of the solutions and see what floats and what sinks. Some > of the first translation software that I bought years ago came out > of Russia. That was most likely based on research done at the their > universities and institutes. And yes I might well look into > Sanskrit as an intermediate language too and most certainly track > down what has already been done in that area.
First, I was just rappin' and trippin' on the idea, not actually proposing that I'd ever be interested in writing such software. :-) Second, I did a little Googling and found that I was on the right track in at least two critical areas. The first was that there is a general con- sensus that the higher the degree of ambiguity in the language itself (the more possible ways that a sentence can be possibly translated, given all the possible meanings and possible parts of speech that the words in that sentence can have), the more monumental the task of translating that language is. The second is that such translation programs are, in fact, done in discrete stages, not as sheer number-crunching dictionary matches. Dictionary matching is seen as the least useful and practical method. A base knowledge of ling- uistics and natural language are considered essential. "Intermediate languages" are used, but they are never an existing human language; none of them are precise and unambiguous enough to qualify. Instead, they are a made-up symbolic language into which the source human language is decoded, prior to encoding it back into other human languages. This tends to be referred to as the "interlingual" approach. An assumption to my earlier rap that I didn't spell out was that any such system has to be heavily rule-based. That is agreed to by all the sources I found today. Dictionary-based translation (one-to-one word mapping) is considered the least successful map- ping, as I suspected. "Statistical" methods trans- late by comparing many side-by-side previous translations between the two languages. "Example- based" translation is similar to what I described as "parsing for idioms," and makes use of a large database of known phrases and word combinations. The best software so far uses combinations of all these methods, not just one, but they tend to *rely* on one approach more than another. For example, SYSTRAN (which underlies Yahoo's Babel- fish) is primarily an example-based system, whereas Google Translate's engine is primarily statistical-based. It is generally agreed that the best commercial translation software available rarely works well enough "out of the box." Instead, like optical recognition software for scanning printed pages, it needs to have the ability to "learn" from past mistakes and correct similar mistakes in the future. Thus you translate, have a native speaker of the target language go in and make corrections to the output, send that back to the processing engine, and hopefully it will do better the next time, having "learned" from its own mistakes. And that's all I know about this arcane subject... :-)