Thanks for the tip!

On Fri, Jan 22, 2021 at 12:38 PM Adam Bittlingmayer <a...@modelfront.com>
wrote:

> Convenience.
>
> Bicleaner and Zipporah are great tools that take a bit more technical work
> to customize and use.
>
> LASER is amazing, but it's really more for cross-language tasks like
> toxicity classification, it was never intended specifically for this task.
> So if the Chinese translation is just English (or has untranslated words),
> or Japanese, pre-trained LASER won't catch it, because the "distance"
> between the two sentences is indeed low.  Same with issues like negation,
> or mismatched numbers.
>
> Paracrawl, for example, has been cleaned with Bicleaner, and WikiMatrix
> with LASER.  But when you run them through ModelFront, you still find
> plenty of dirty, dirty sentence pairs.
>
>
>
> On Wed, 20 Jan 2021 at 15:55, Nerses Nersesyan <nersesyanner...@gmail.com>
> wrote:
>
>> How's it different than Bicleaner or LASER?
>>
>> On Tue, Jan 19, 2021 at 4:09 PM <mt-list-requ...@eamt.org> wrote:
>>
>>> Send Mt-list mailing list submissions to
>>>         mt-list@eamt.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>         http://lists.eamt.org/mailman/listinfo/mt-list
>>> or, via email, send a message with subject or body 'help' to
>>>         mt-list-requ...@eamt.org
>>>
>>> You can reach the person managing the list at
>>>         mt-list-ow...@eamt.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of Mt-list digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>>    1. Re: CFP: WAT2021 (The 8th Workshop on Asian Translation)
>>>       (Adam Bittlingmayer)
>>>
>>>
>>> ----------------------------------------------------------------------
>>>
>>> Message: 1
>>> Date: Tue, 19 Jan 2021 12:00:26 +0400
>>> From: Adam Bittlingmayer <a...@modelfront.com>
>>> To: Toshiaki Nakazawa <nakaz...@logos.t.u-tokyo.ac.jp>
>>> Cc: mt-list@eamt.org
>>> Subject: Re: [Mt-list] CFP: WAT2021 (The 8th Workshop on Asian
>>>         Translation)
>>> Message-ID:
>>>         <
>>> calson-dwazwpk-v+znqmee4qqjyyzpmukf63h+afcw5dtyb...@mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> How well is it working for low-resource langs?
>>> >
>>>
>>> We try to support all language pairs.  I've tried Inuktitut-English and
>>> Hindi-Marathi, for example.
>>>
>>> The main factors are:
>>>
>>> 1. How dirty your parallel corpus is
>>> In that sense, low-resource languages are often easier.  The relative
>>> ranking just needs to be working.
>>>
>>> 3. How much data we have for the language
>>> My own language (Alemannic) is *not* working well.  It's not in Mozilla
>>> TMs, BERT or LASER, and has no standard orthography.  But a language like
>>> Armenian, with a smaller number of speakers and lower GDP, is working
>>> better, because their Wikipedia is top, and their unique script makes it
>>> easy to identify.  In this conference, I expect Oriya/Odia and Khmer will
>>> be the toughest.
>>>
>>> 2. How much data we have for the *pair*
>>> We have seen Hindi-Marathi and Russian-Armenian working decently, but
>>> they
>>> are well-established pairs with a lot of cultural overlap (Sprachbund).
>>>
>>> 3. Your use case
>>> Training from scratch for a generic system on very large datasets is
>>> different than fine-tuning for a domain on small data.  (For the former,
>>> you usually want strict 1:1ness, e.g. miles should not convert to
>>> kilometres.)  It won't work well out of the box if you're doing
>>> adversarial
>>> attacks or need it calibrated across language pairs.
>>>
>>> 4. If the low-resource language is the source or the target language
>>> Just imagine a human doing this, who only knows one of the languages.
>>>
>>> There is an unknown language option (*other UND*) so you can even try it
>>> on
>>> languages not in the dropdown.  That works better if it's the source
>>> language, not the target language.
>>>
>>> If you see issues or have data that can improve a language pair, let me
>>> know.
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> http://lists.eamt.org/mailman/private/mt-list/attachments/20210119/f9a2c03d/attachment-0001.html
>>> >
>>>
>>> ------------------------------
>>>
>>> Subject: Digest Footer
>>>
>>> _______________________________________________
>>> Mt-list mailing list
>>> Mt-list@eamt.org
>>> http://lists.eamt.org/mailman/listinfo/mt-list
>>>
>>>
>>> ------------------------------
>>>
>>> End of Mt-list Digest, Vol 88, Issue 16
>>> ***************************************
>>>
>>
>>
>> --
>> Best regards,
>> Nerses Nersesyan
>>
>>
>>
>>

-- 
Best regards,
Nerses Nersesyan
_______________________________________________
Mt-list site list
Mt-list@eamt.org
http://lists.eamt.org/mailman/listinfo/mt-list

Reply via email to