[Moses-support] CfP: Shared Task: Parallel Corpus Filtering for Low-Resource Conditions

Paco Guzman Fri, 15 Feb 2019 08:45:27 -0800

[Apologies for cross-posting]

CALL FOR PARTICIPATION
Shared Task: Parallel Corpus Filtering for Low-Resource Conditions
at the Fourth Conference on Machine Translation (WMT19)
http://statmt.org/wmt19/parallel-corpus-filtering.html


This new shared task tackles the problem of cleaning noisy parallel corpora. 
Following the WMT18 shared task on parallel corpus 
filtering<http://www.statmt.org/wmt18/parallel-corpus-filtering.html>, we now 
pose the problem under more challenging low-resource conditions. Instead of 
German-English, this year there are two low-resource language pairs: 
Nepali-English and Sinhala-English.
Otherwise, the shared task follows the same set-up: given a noisy parallel 
corpus (crawled from the web), participants develop methods to filter it to a 
smaller size of high quality sentence pairs.

DETAILS
We provide a very noisy 35.5 million-word (English token count) Nepali-English 
corpus and a 59.6 million-word Sinhala-English corpus crawled from the web as 
part of the Paracrawl<http://paracrawl.eu/> project. We ask participants to 
provide scores for each sentence in each of the noisy parallel sets. The scores 
will be used to subsample sentence pairs that amount to 5 million English 
words. The quality of the resulting subsets is determined by the quality of a 
statistical machine translation (Moses, phrase-based) and neural machine 
translation system (FAIRseq) trained on this data. The quality of the machine 
translation system is measured by BLEU score (sacrebleu) on a held-out test set 
of Wikipedia translations <https://github.com/facebookresearch/flores> for 
Sinhala-English and Nepali-English.

We also provide links to training data for the two language pairs. This 
existing data comes from a variety of sources and is of mixed quality and 
relevance. We provide a script to fetch and compose the training data.

Note that the task addresses the challenge of data quality and not 
domain-relatedness of the data for a particular use case. While we provide a 
development and development test set that are also drawn from Wikipedia 
articles, these may be very different from the final official test set in terms 
of topics.
The provided raw parallel corpora are the outcome of a processing pipeline that 
aimed from high recall at the cost of precision, so they are very noisy. They 
exhibit noise of all kinds (wrong language in source and target, sentence pairs 
that are not translations of each other, bad language, incomplete of bad 
translations, etc.).


IMPORTANT DATES
Release of raw parallel data: February 8, 2019
Submission deadline for subsampled sets: May 10, 2019
System descriptions due: May 17, 2019
Announcement of results: June 3, 2019
Paper notification: June 7, 2019
Camera-ready for system descriptions: June 17, 2019


ORGANIZERS
Philipp Koehn (Johns Hopkins University / University of Edinburgh)
Francisco (Paco) Guzmán (Facebook)
Vishrav Chaudhary (Facebook)
Juan Pino (Facebook)

More information is available at 
http://statmt.org/wmt19/parallel-corpus-filtering.html

Similarly to other WMT tasks, intending participants are encouraged to register 
to https://groups.google.com/forum/#!forum/wmt-tasks for discussions and 
announcements.


-- Francisco (Paco) Guzman

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] CfP: Shared Task: Parallel Corpus Filtering for Low-Resource Conditions

Reply via email to