Apologies for cross-posting

We cordially invite all researchers and practitioners to participate in the 
Non-Repetitive Translation task in WMT 2024.



This task focuses on lexical choice in machine translation. If interested, see 
the details through this link:  
https://www2.statmt.org/wmt24/non-repetitive-translation-task.html



*Important Dates*

Submission deadline for the task: July 21st
Paper submission deadline: August 20th
Notification of acceptance: September 20th
Camera-ready deadline: October 3rd
Conference: November 15-16

*Task Description*

This task focuses on lexical choice in machine translation, especially choice 
regarding repeated words in a source sentence. Generally, the repetition of the 
same words can create a monotonous or awkward impression in English, and it 
should be appropriately avoided. Typical workarounds in monolingual writing are 
to (1) remove redundant terms if possible (reduction) or (2) use alternative 
words such as synonyms as substitutes (substitution). These techniques are also 
observed in human translations.
The goal of this task is to study how these techniques can be incorporated into 
machine translation systems to enrich lexical choice capabilities. From a 
practical standpoint, such capability would be important, for example, in news 
production, where high quality text that goes beyond robotic word-by-word 
translation is required. Specifically, participants are required to control a 
machine translation system using reduction or substitution so that it does not 
output the same words for certain repeated words in a source sentence. The 
translation direction is Japanese to English.



*Challenges*



The challenges underlying this task include the following:
- Maintaining the balance between translation quality and controlling the 
output: The translation quality can be degraded when the non-repetitive style 
is inappropriately enforced.
- Avoiding bias toward high-frequency bilingual word pairs: In general, for a 
given source word, high-frequency target words associated with it are more 
likely to be output. This can make it difficult to determine appropriate 
substitutions for some words.
- Predicting which words can be reduced or substituted: Predicting which source 
words can be reduced or substituted appropriately is not an easy problem 
because it depends on the context within the sentence.
- Mining training instances: Translations with reduction can be especially 
difficult to identify in noisy corpora because of the challenge of 
discriminating them from undertranslations.



*Data Set*



We provide development and test sets for this task. In both data sets, all 
Japanese sentences contain some repeated words that are translated into English 
with reduction or substitution. We collected these data from Jiji 
Japanese�CEnglish news articles. Specifically, we first automatically created 
sentence pairs based on lexical similarities, and then manually selected 
instances suited for this task. These sentence pairs include not only 
one-to-one pairs but two-to-two pairs. Both the development and test sets 
contain raw and tagged parallel data. In the tagged data, repeated words in the 
source sentence and their counterparts in the target sentence are marked with 
tags, which indicates that these words are evaluation targets. Note that not 
all words repeated in the source sentence are evaluation targets. This is 
because some words, such as proper nouns and technical terms, should be 
translated consistently, even if they are repeated in the sentence.
Tagged development data are provided to help tune the model during training.  
However, participants cannot use tagged test data and must use raw test data 
when submitting the system results. In this task, the systems must detect 
repeated words which can be reduced or substituted on their own.
To reduce the negative effects of imbalanced content in the source and target 
sentences, the Japanese sentences in the development and test data were 
manually translated from the English while preserving as much of the vocabulary 
of the original Japanese sentences as possible.



As for the training data, we also provide all the data from the WAT 2020 
Newswire tasks, which were also constructed from Jiji news articles and have 
been continuously used in WAT from 2020. These data are a regular parallel 
corpus and have not been annotated specifically for this task, but are in 
exactly the same domain as this task. Although the development and test data 
from the WAT 2020 Newswire tasks are not directly related to the evaluation of 
this task, these can be used to measure basic translation performance during 
training. In addition, participants can also use any other publicly available 
corpora, such as the data from the general MT task in WMT, for training. When 
using external data, be sure to include an explanation about the data in your 
paper.



*Organizers*
Kazutaka Kinugawa (kinugawa.k...@nhk.or.jp<mailto:kinugawa.k...@nhk.or.jp>), NHK
Hideya Mino, NHK
Naoto Shirai, NHK
Isao Goto, Ehime University

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
https://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to