How well is it working for low-resource langs?
>

We try to support all language pairs.  I've tried Inuktitut-English and
Hindi-Marathi, for example.

The main factors are:

1. How dirty your parallel corpus is
In that sense, low-resource languages are often easier.  The relative
ranking just needs to be working.

3. How much data we have for the language
My own language (Alemannic) is *not* working well.  It's not in Mozilla
TMs, BERT or LASER, and has no standard orthography.  But a language like
Armenian, with a smaller number of speakers and lower GDP, is working
better, because their Wikipedia is top, and their unique script makes it
easy to identify.  In this conference, I expect Oriya/Odia and Khmer will
be the toughest.

2. How much data we have for the *pair*
We have seen Hindi-Marathi and Russian-Armenian working decently, but they
are well-established pairs with a lot of cultural overlap (Sprachbund).

3. Your use case
Training from scratch for a generic system on very large datasets is
different than fine-tuning for a domain on small data.  (For the former,
you usually want strict 1:1ness, e.g. miles should not convert to
kilometres.)  It won't work well out of the box if you're doing adversarial
attacks or need it calibrated across language pairs.

4. If the low-resource language is the source or the target language
Just imagine a human doing this, who only knows one of the languages.

There is an unknown language option (*other UND*) so you can even try it on
languages not in the dropdown.  That works better if it's the source
language, not the target language.

If you see issues or have data that can improve a language pair, let me
know.
_______________________________________________
Mt-list site list
[email protected]
http://lists.eamt.org/mailman/listinfo/mt-list

Reply via email to