Re: [Moses-support] PMIndia - A Collection of Parallel Corpora of Languages of India

2020-01-29 Thread Thoudam Doren Singh
Hi Barry

Good job. For some language pairs below 10k, it's quite appealing BLEU
scores as reported.


Best Regards


Doren



On Wednesday, January 29, 2020, Barry Haddow  wrote:

> Hi All
>
> We have released a new sentence aligned corpora pairing English with 13
> different languages spoken in India. Up to 56k sentence pairs are
> available for each pair. The languages of India contained in the corpora
> are Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Manipuri,
> Marathi, Odia, Punjabi, Tamil, Telugu and Urdu. We also provide a larger
> version of the corpus, document-aligned only.
>
> The corpus is available here: http://data.statmt.org/pmindia/
>
> There is an accompanying paper which describes the construction of the
> corpus, a comparison of alignment methods, and some initial MT results.
>
> https://arxiv.org/abs/2001.09907
>
>
> Barry Haddow and Faheem Kirefu
>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] PMIndia - A Collection of Parallel Corpora of Languages of India

2020-01-29 Thread Barry Haddow
Hi All

We have released a new sentence aligned corpora pairing English with 13 
different languages spoken in India. Up to 56k sentence pairs are 
available for each pair. The languages of India contained in the corpora 
are Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Manipuri, 
Marathi, Odia, Punjabi, Tamil, Telugu and Urdu. We also provide a larger 
version of the corpus, document-aligned only.

The corpus is available here: http://data.statmt.org/pmindia/

There is an accompanying paper which describes the construction of the 
corpus, a comparison of alignment methods, and some initial MT results.

https://arxiv.org/abs/2001.09907


Barry Haddow and Faheem Kirefu




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support