Hi All We have released a new sentence aligned corpora pairing English with 13 different languages spoken in India. Up to 56k sentence pairs are available for each pair. The languages of India contained in the corpora are Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Manipuri, Marathi, Odia, Punjabi, Tamil, Telugu and Urdu. We also provide a larger version of the corpus, document-aligned only.
The corpus is available here: http://data.statmt.org/pmindia/ There is an accompanying paper which describes the construction of the corpus, a comparison of alignment methods, and some initial MT results. https://arxiv.org/abs/2001.09907 Barry Haddow and Faheem Kirefu -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support