SECOND CALL FOR PARTICIPATION *Shared Task: Parallel Corpus Filtering for Low-Resource Conditions* at the Fourth Conference on Machine Translation (WMT19) http://statmt.org/wmt19/parallel-corpus-filtering.html
This new shared task tackles the problem of cleaning noisy parallel corpora. Following the WMT18 shared task on parallel corpus filtering <http://www.statmt.org/wmt18/parallel-corpus-filtering.html>, we now pose the problem under more challenging low-resource conditions. Instead of German-English, this year there are two low-resource language pairs: Nepali-English and Sinhala-English. Otherwise, the shared task follows the same set-up: given a noisy parallel corpus (crawled from the web), participants develop methods to filter it to a smaller size of high quality sentence pairs. *DETAILS* We provide a very noisy 35.5 million-word (English token count) Nepali-English corpus and a 59.6 million-word Sinhala-English corpus crawled from the web as part of the Paracrawl <http://paracrawl.eu/> project. We ask participants to provide scores for each sentence in each of the noisy parallel sets. The scores will be used to subsample sentence pairs that amount to 5 million English words. The quality of the resulting subsets is determined by the quality of a statistical machine translation (Moses, phrase-based) and neural machine translation system (FAIRseq) trained on this data. The quality of the machine translation system is measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia translations <https://github.com/facebookresearch/flores>for Sinhala-English and Nepali-English. We also provide links to training data for the two language pairs. This existing data comes from a variety of sources and is of mixed quality and relevance. We provide a script to fetch and compose the training data. Note that the task addresses the challenge of *data quality* and *not domain-relatedness* of the data for a particular use case. While we provide a development and development test set that are also drawn from Wikipedia articles, these may be very different from the final official test set in terms of topics. The provided raw parallel corpora are the outcome of a processing pipeline that aimed from high recall at the cost of precision, so they are very noisy. They exhibit noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.). *IMPORTANT DATES* Release of raw parallel data: February 8, 2019 Submission deadline for subsampled sets: May 10, 2019 System descriptions due: May 17, 2019 Announcement of results: June 3, 2019 Paper notification: June 7, 2019 Camera-ready for system descriptions: June 17, 2019 *ORGANIZERS* Philipp Koehn (Johns Hopkins University / University of Edinburgh) Francisco (Paco) Guzmán (Facebook) Vishrav Chaudhary (Facebook) Juan Pino (Facebook) More information is available at http://statmt.org/wmt19/parallel-corpus- filtering.html Similarly to other WMT tasks, intending participants are encouraged to register to https://groups.google.com/forum/#!forum/wmt-tasks for discussions and announcements.
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support