How well is it working for low-resource langs? > We try to support all language pairs. I've tried Inuktitut-English and Hindi-Marathi, for example.
The main factors are: 1. How dirty your parallel corpus is In that sense, low-resource languages are often easier. The relative ranking just needs to be working. 3. How much data we have for the language My own language (Alemannic) is *not* working well. It's not in Mozilla TMs, BERT or LASER, and has no standard orthography. But a language like Armenian, with a smaller number of speakers and lower GDP, is working better, because their Wikipedia is top, and their unique script makes it easy to identify. In this conference, I expect Oriya/Odia and Khmer will be the toughest. 2. How much data we have for the *pair* We have seen Hindi-Marathi and Russian-Armenian working decently, but they are well-established pairs with a lot of cultural overlap (Sprachbund). 3. Your use case Training from scratch for a generic system on very large datasets is different than fine-tuning for a domain on small data. (For the former, you usually want strict 1:1ness, e.g. miles should not convert to kilometres.) It won't work well out of the box if you're doing adversarial attacks or need it calibrated across language pairs. 4. If the low-resource language is the source or the target language Just imagine a human doing this, who only knows one of the languages. There is an unknown language option (*other UND*) so you can even try it on languages not in the dropdown. That works better if it's the source language, not the target language. If you see issues or have data that can improve a language pair, let me know.
_______________________________________________ Mt-list site list [email protected] http://lists.eamt.org/mailman/listinfo/mt-list
