[Moses-support] PMIndia - A Collection of Parallel Corpora of Languages of India

Barry Haddow Wed, 29 Jan 2020 03:50:52 -0800

Hi All

We have released a new sentence aligned corpora pairing English with 13 
different languages spoken in India. Up to 56k sentence pairs are 
available for each pair. The languages of India contained in the corpora 
are Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Manipuri, 
Marathi, Odia, Punjabi, Tamil, Telugu and Urdu. We also provide a larger 
version of the corpus, document-aligned only.


The corpus is available here: http://data.statmt.org/pmindia/

There is an accompanying paper which describes the construction of the 
corpus, a comparison of alignment methods, and some initial MT results.

https://arxiv.org/abs/2001.09907


Barry Haddow and Faheem Kirefu




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] PMIndia - A Collection of Parallel Corpora of Languages of India

Reply via email to