MT4All Unsupervised MT Shared Task

at SIGUL 2022

(24-25 June, Marseille)


SECOND CALL FOR PARTICIPATION


We invite you to participate in the first edition of the MT4All Unsupervised Machine Translation Shared Task, hosted by the ELRA/ISCA Special Interest Group on Under-Resourced Languages Workshop (SIGUL 2022). Papers on the task will be published as part of the Proceedings.


Invitation to Participate – Expression of Interest <https://docs.google.com/forms/d/1tllq0jWhcKwMHgPtRCA4aLkgLDuN8JlZG7Vp4TqcNQ0>.


TASK DESCRIPTION

For this Shared task we will leverage the resources generated by the recently finished CEF project MT4All , with the aim of exploring unsupervised MT techniques based only on monolingual corpora. In the course of the project, the following novel datasets were created: 18 monolingual corpora for specific languages and domains, 12 bilingual dictionaries and translation models, and 10 annotated datasets for evaluation. Most of them will be used in the present Shared task.

The task is divided into three separate subtasks, each one covering a specific domain and set of languages.

 *

   Subtask 1: Unsupervised translation from English to Ukrainian,
   Georgian and Kazakh in the Legal domain.

 *

   Subtask 2: Unsupervised translation from English to Finnish,
   Latvian, and Norwegian Bokmål in the Financial domain.

 *

   Subtask 3: Unsupervised translation from English to German,
   Norwegian Bokmål, and Spanish in the Customer support domain.

In this Shared task, we are interested in how the in-domain monolingual data that we will provide can be leveraged by creating a purely unsupervised machine translation model, either by

 *

   training an unsupervised model from scratch, or

 *

   adding value to an existing pre-trained model, on the condition that

     o

       it has been trained on monolingual datasets

     o

       it has not been fine-tuned with any parallel data

     o

       it is publicly accessible from the HuggingFace repository

Although we exclude the possibility of fine-tuning the models with any existing parallel data, we allow making use of the bilingual resources created in the framework of MT4All using purely unsupervised technologies.

As additional monolingual data, we allow the use of any monolingual Oscar dataset, only.

IMPORTANT DATES

 *

   Training data release10.03.2022

 *

   Test sets release25.04.2022

 *

   Results deadline02.05.2022

 *

   Paper submission deadline16.05.2022

 *

   Acceptance notice30.05.2022

 *

   Camera ready13.06.2022

 *

   Workshop starts24.06.2022

Please visit the website for more details: https://sigul-2022.ilc.cnr.it/mt4all-shared-task/ <https://sigul-2022.ilc.cnr.it/mt4all-shared-task/>

If you have any comments and/or questions, do not hesitate to contact ksenia.kharitonova at bsc.es <http://bsc.es/>.
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
corp...@uib.no
https://mailman.uib.no/listinfo/corpora

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
https://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to