Thanks for the tip! On Fri, Jan 22, 2021 at 12:38 PM Adam Bittlingmayer <a...@modelfront.com> wrote:
> Convenience. > > Bicleaner and Zipporah are great tools that take a bit more technical work > to customize and use. > > LASER is amazing, but it's really more for cross-language tasks like > toxicity classification, it was never intended specifically for this task. > So if the Chinese translation is just English (or has untranslated words), > or Japanese, pre-trained LASER won't catch it, because the "distance" > between the two sentences is indeed low. Same with issues like negation, > or mismatched numbers. > > Paracrawl, for example, has been cleaned with Bicleaner, and WikiMatrix > with LASER. But when you run them through ModelFront, you still find > plenty of dirty, dirty sentence pairs. > > > > On Wed, 20 Jan 2021 at 15:55, Nerses Nersesyan <nersesyanner...@gmail.com> > wrote: > >> How's it different than Bicleaner or LASER? >> >> On Tue, Jan 19, 2021 at 4:09 PM <mt-list-requ...@eamt.org> wrote: >> >>> Send Mt-list mailing list submissions to >>> mt-list@eamt.org >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>> http://lists.eamt.org/mailman/listinfo/mt-list >>> or, via email, send a message with subject or body 'help' to >>> mt-list-requ...@eamt.org >>> >>> You can reach the person managing the list at >>> mt-list-ow...@eamt.org >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of Mt-list digest..." >>> >>> >>> Today's Topics: >>> >>> 1. Re: CFP: WAT2021 (The 8th Workshop on Asian Translation) >>> (Adam Bittlingmayer) >>> >>> >>> ---------------------------------------------------------------------- >>> >>> Message: 1 >>> Date: Tue, 19 Jan 2021 12:00:26 +0400 >>> From: Adam Bittlingmayer <a...@modelfront.com> >>> To: Toshiaki Nakazawa <nakaz...@logos.t.u-tokyo.ac.jp> >>> Cc: mt-list@eamt.org >>> Subject: Re: [Mt-list] CFP: WAT2021 (The 8th Workshop on Asian >>> Translation) >>> Message-ID: >>> < >>> calson-dwazwpk-v+znqmee4qqjyyzpmukf63h+afcw5dtyb...@mail.gmail.com> >>> Content-Type: text/plain; charset="utf-8" >>> >>> How well is it working for low-resource langs? >>> > >>> >>> We try to support all language pairs. I've tried Inuktitut-English and >>> Hindi-Marathi, for example. >>> >>> The main factors are: >>> >>> 1. How dirty your parallel corpus is >>> In that sense, low-resource languages are often easier. The relative >>> ranking just needs to be working. >>> >>> 3. How much data we have for the language >>> My own language (Alemannic) is *not* working well. It's not in Mozilla >>> TMs, BERT or LASER, and has no standard orthography. But a language like >>> Armenian, with a smaller number of speakers and lower GDP, is working >>> better, because their Wikipedia is top, and their unique script makes it >>> easy to identify. In this conference, I expect Oriya/Odia and Khmer will >>> be the toughest. >>> >>> 2. How much data we have for the *pair* >>> We have seen Hindi-Marathi and Russian-Armenian working decently, but >>> they >>> are well-established pairs with a lot of cultural overlap (Sprachbund). >>> >>> 3. Your use case >>> Training from scratch for a generic system on very large datasets is >>> different than fine-tuning for a domain on small data. (For the former, >>> you usually want strict 1:1ness, e.g. miles should not convert to >>> kilometres.) It won't work well out of the box if you're doing >>> adversarial >>> attacks or need it calibrated across language pairs. >>> >>> 4. If the low-resource language is the source or the target language >>> Just imagine a human doing this, who only knows one of the languages. >>> >>> There is an unknown language option (*other UND*) so you can even try it >>> on >>> languages not in the dropdown. That works better if it's the source >>> language, not the target language. >>> >>> If you see issues or have data that can improve a language pair, let me >>> know. >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: < >>> http://lists.eamt.org/mailman/private/mt-list/attachments/20210119/f9a2c03d/attachment-0001.html >>> > >>> >>> ------------------------------ >>> >>> Subject: Digest Footer >>> >>> _______________________________________________ >>> Mt-list mailing list >>> Mt-list@eamt.org >>> http://lists.eamt.org/mailman/listinfo/mt-list >>> >>> >>> ------------------------------ >>> >>> End of Mt-list Digest, Vol 88, Issue 16 >>> *************************************** >>> >> >> >> -- >> Best regards, >> Nerses Nersesyan >> >> >> >> -- Best regards, Nerses Nersesyan
_______________________________________________ Mt-list site list Mt-list@eamt.org http://lists.eamt.org/mailman/listinfo/mt-list