[Moses-support] Release 2019 of DGT-Translation Memory (free parallel corpus in 24 languages)

Camelia Ignat Thu, 20 Jun 2019 06:34:23 -0700

Dear all,

We are happy to announce that the 2019 update release of the
DGT-Translation Memory (DGT-TM) is now available for free download. DGT-TM
covers *24 languages and 276 language pairs*.

*This year’s release* *adds 10 million translation units (~ sentences) – or
171 million words – to the collection*.

With this update, a *total of 131 million translation units is now
available for download, equivalent to over 2.2 billion words*. More data
for language pairs involving *Maltese* are available on request.

DGT-TM is an extraction of the translation memory of the European
Institutions for all 24 official EU languages, produced by the European
Commission’s *Directorate General for Translation* (DGT) and distributed by
the *Joint Research Centre* (JRC). Translation memories are sentences and
their manually produced translations.

The new release is called *DGT-TM-2019*. It follows the original 2007
release DGT-TM and the yearly updates since 2011.

*Languages:* All *276 language pairs* involving the following 24 languages:

Bulgarian, Croatian, Czech, Danish, Dutch, English,
Estonian,

German, Greek, Finnish, French, Irish, Hungarian, Italian,

Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian,

Slovak, Slovene, Spanish and Swedish.

*URL: *
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory

*Creator: *European Commission - Directorate General for Translation (DGT
<http://ec.europa.eu/dgs/translation/index_en.htm>)

*WHAT IS DGT-TM*

The ‘Acquis Communautaire <http://europa.eu/abc/eurojargon/index_en.htm>’
is the entire body of European legislation, comprising all the treaties,
regulations and directives adopted by the European Union (EU). Since each
new country joining the EU is required to accept the whole Acquis
Communautaire, this body of legislation has been translated into 23
official languages. For the 24th official EU language, *Irish*, the Acquis
has not been translated on a regular basis; which is why DGT-TM includes
less data in Irish. The Acquis Communautaire was split into sentences and
aligned automatically at sentence level, resulting in the DGT translation
memory, DGT-TM. Small parts of the alignment data have been corrected by
translators. The text data is accompanied by software that allows
extracting all sentences and their translations for any of the 276 possible
language pair combinations.

*MOTIVATION FOR THIS RELEASE*

The public data release is in line with the general effort of the European
Commission to support multilingualism, language diversity and the re-use of
Commission information. It follows the release of a number of further
multilingual data sets:

· the *JRC-Acquis* parallel corpus in 2006 (over 1 billion words in
22 languages),

· the *DGT-TM* Translation Memory in 2007,

· the multilingual named entity resource *JRC-Names* in 2011 (and
its Linked Data version in 2016),

· the multilingual multi-label classification tool (and
accompanying text data) *JRC EuroVoc Indexer (JEX)* (22 languages) in 2012,

· the *ECDC-TM* Translation Memory in 2012 (domain: Public Health)

· the *DGT-Acquis* parallel corpus in 2012,

· the *EAC-TM* Translation Memory in 2013 (domain: Education and
Culture),

· the *DCEP* (Digital Corpus of the European Parliament) in 2014,

· and further smaller multilingual resources.

See https://ec.europa.eu/jrc/en/language-technologies for more information
on these resources.

*WHAT DGT-TM CAN BE USED FOR*

DGT-TM can be fed into translation memory software to support human
translators in their work. As it is a large parallel corpus in electronic
form, it can furthermore be used by specialists in computational
linguistics to train statistical machine translation software, to generate
multilingual dictionaries, to train and test multilingual information
extraction software, and more.

*MORE INFORMATION ON DGT-TM*

At https://wt-public.emm4u.eu/Resources/JRC-EMM_Publications.pdf, you find
detailed publications on the JRC’s multilingual language technology activity
<https://wt-public.emm4u.eu/Resources/JRC-EMM_Publications.pdf>. For
details specifically on DGT-TM, you can read:

Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos &
Patrick Schlüter (2012).

*DGT-TM: A freely Available Translation Memory in 22 Languages*
<http://www.lrec-conf.org/proceedings/lrec2012/pdf/814_Paper.pdf>.

Proceedings of the 8th international conference on Language

Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012.

http://www.lrec-conf.org/proceedings/lrec2012/pdf/814_Paper.pdf

The following more recent article compares all freely available Language
Technology resources distributed by the JRC and provides comparative
background information:

Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel

Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro
(2014).
*An overview of the European Union's highly multilingual parallel
corpora <http://link.springer.com/article/10.1007/s10579-014-9277-0>*.
Language Resources and Evaluation Journal (LRE).
DOI: 10.1007/s10579-014-9277-0.
(*Read the manuscript
<http://langtech.jrc.it/Documents/2014_08_LRE-Journal_JRC-Linguistic-Resources_Manuscript.pdf>
*
at

https://ec.europa.eu/jrc/sites/jrcsh/files/2014_08_LRE-Journal_JRC-Linguistic-Resources_Manuscript.pdf
).

[image: cid:image001.gif@01D2965E.5F11CCE0]

Camelia Ignat, PhD

European Commission – Joint Research Centre (JRC)

Directorate I – Competences,

I 03 – Competence Centre on Text Mining and Analysis

T.P. 440, Via E. Fermi
<https://www.google.com/maps/search/440,+Via+E.+Fermi?entry=gmail&source=g>
2749

21027 Ispra (VA), Italy

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] Release 2019 of DGT-Translation Memory (free parallel corpus in 24 languages)

Reply via email to