Re: [Wikimedia-l] The case for supporting open source machine translation

Nikola Smolenski Thu, 25 Apr 2013 02:31:42 -0700

On 24/04/13 12:35, Denny Vrandečić wrote:

Current machine translation research aims at using massive machine learning
supported systems. They usually require big parallel corpora. We do not
have big parallel corpora (Wikipedia articles are not translations of each
other, in general), especially not for many languages, and there is no

Could you define "big"? If 10% of Wikipedia articles are translations ofeach other, we have 2 million translation pairs. Assuming ten sentencesper average article, this is 20 million sentence pairs. An averageWikipedia with 100,000 articles would have 10,000 translations and100,000 sentence pairs; a large Wikipedia with 1,000,000 articles wouldhave 100,000 translations and 1,000,000 sentence pairs - is this notenough to kickstart a massive machine learning supported system?(Consider also that the articles are somewhat similar in structure andless rich than general text - future tense is rarely used for example.)


_______________________________________________
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l

Re: [Wikimedia-l] The case for supporting open source machine translation

Reply via email to