Re: [Moses-support] CfP: Shared Task: Parallel Corpus Filtering for Low-Resource Conditions

Ergun Bicici Sat, 16 Feb 2019 06:42:46 -0800

The audience for this question might be search engines and not the WMT'19
task. But it is still relevant if/when UN Corpus is ranked lower in
internet search for parallel corpora to deliberately hide the searchers'
intended corpora and there may be places and search engines where internet
access is diverted. For those cases, internet tunneling might work. Some
legal clarification can also be added to specify that WMT'19 is not funded
by or working for search engines. Yet, Yandex did provide some data before
and thanked for that in WMT'18 but acknowledgments for this year appear not
finished yet. The license for the data provided also does matter as this
can protect against maluse cases; make the data useless for some audience.


Ergun

On Fri, Feb 15, 2019 at 11:40 PM Philipp Koehn <p...@jhu.edu> wrote:

> Hi,
>
> the identity of the languages that we are tackling here do matter so much
> as solving this problem for low resource languages. We did not prepare
> this for Arabic, so this would be possible at the moment.
>
> By the way, there is a massive parallel corpus for Arabic here:
> https://cms.unov.org/UNCorpus/
>
> -phi
>
> On Fri, Feb 15, 2019 at 1:43 PM Marwa Refaie <basmal...@hotmail.com>
> wrote:
>
>> Dear All
>>
>> I can’t find enough resources even for English-Arabic ... can this call
>> include Arabic with the mentioned languages ??
>>
>> Thanks in Advance
>>
>> Marwa N. Refaie
>>
>> On Feb 15, 2019, at 18:44, Paco Guzman <fguz...@fb.com> wrote:
>>
>> [Apologies for cross-posting]
>>
>> CALL FOR PARTICIPATION
>> *Shared Task: Parallel Corpus Filtering for Low-Resource Conditions*
>> at the Fourth Conference on Machine Translation (WMT19)
>> http://statmt.org/wmt19/parallel-corpus-filtering.html
>>
>> This new shared task tackles the problem of cleaning noisy parallel
>> corpora. Following the WMT18 shared task on parallel corpus filtering
>> <http://www.statmt.org/wmt18/parallel-corpus-filtering.html>, we now
>> pose the problem under more challenging low-resource conditions. Instead of
>> German-English, this year there are two low-resource language pairs:
>> Nepali-English and Sinhala-English.
>> Otherwise, the shared task follows the same set-up: given a noisy
>> parallel corpus (crawled from the web), participants develop methods to
>> filter it to a smaller size of high quality sentence pairs.
>>
>> *DETAILS*
>> We provide a very noisy 35.5 million-word (English token count)
>> Nepali-English corpus and a 59.6 million-word Sinhala-English corpus
>> crawled from the web as part of the Paracrawl <http://paracrawl.eu/>
>> project. We ask participants to provide scores for each sentence in each of
>> the noisy parallel sets. The scores will be used to subsample sentence
>> pairs that amount to 5 million English words. The quality of the resulting
>> subsets is determined by the quality of a statistical machine translation
>> (Moses, phrase-based) and neural machine translation system (FAIRseq)
>> trained on this data. The quality of the machine translation system is
>> measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia
>> translations <https://github.com/facebookresearch/flores>for
>> Sinhala-English and Nepali-English.
>>
>> We also provide links to training data for the two language pairs. This
>> existing data comes from a variety of sources and is of mixed quality and
>> relevance. We provide a script to fetch and compose the training data.
>>
>> Note that the task addresses the challenge of *data quality* and *not
>> domain-relatedness* of the data for a particular use case. While we
>> provide a development and development test set that are also drawn from
>> Wikipedia articles, these may be very different from the final official
>> test set in terms of topics.
>> The provided raw parallel corpora are the outcome of a processing
>> pipeline that aimed from high recall at the cost of precision, so they are
>> very noisy. They exhibit noise of all kinds (wrong language in source and
>> target, sentence pairs that are not translations of each other, bad
>> language, incomplete of bad translations, etc.).
>>
>>
>> *IMPORTANT DATES*
>> Release of raw parallel data: February 8, 2019
>> Submission deadline for subsampled sets: May 10, 2019
>> System descriptions due: May 17, 2019
>> Announcement of results: June 3, 2019
>> Paper notification: June 7, 2019
>> Camera-ready for system descriptions: June 17, 2019
>>
>>
>> * ORGANIZERS*
>> Philipp Koehn (Johns Hopkins University / University of Edinburgh)
>> Francisco (Paco) Guzmán (Facebook)
>> Vishrav Chaudhary (Facebook)
>> Juan Pino (Facebook)
>>
>> More information is available at
>> http://statmt.org/wmt19/parallel-corpus-filtering.html
>>
>> Similarly to other WMT tasks, intending participants are encouraged to
>> register to https://groups.google.com/forum/#!forum/wmt-tasks for
>> discussions and announcements.
>>
>>
>>
>>
>>
>> -- Francisco (Paco) Guzman
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] CfP: Shared Task: Parallel Corpus Filtering for Low-Resource Conditions

Reply via email to