[Mt-list] Release of Wikipedia-based monolingual, comparable, and parallel corpora

Peter Kolb Thu, 27 Nov 2014 01:59:20 -0800

Dear colleagues,

we have released three types of corpora extracted from 23 language versions
of Wikipedia:


1. Wikipedia Monolingual Corpora: more than 5 billion tokens of text in 23
languages extracted from the Wikipedia. The corpora are annotated with
article and paragraph boundaries, number of incoming links for each
article, anchor texts used to refer to each article (textlinks) and their
frequencies, crosslanguage links, categories and more (
http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/). There
is also a script that allows to extract domain-specific sub-corpora if you
provide a list of desired categories.

2. Wikipedia Comparable Corpora: more than 41 million bilingually aligned
Wikipedia articles for 253 language pairs (
http://linguatools.org/tools/corpora/wikipedia-comparable-corpora/).

3. Wikipedia Parallel Titles Corpora: bilingual titles of Wikipedia
articles, extended with redirects and textlinks. 487,406,497 unique
parallel segments for 253 language pairs (
http://linguatools.org/tools/corpora/wikipedia-parallel-titles-corpora/).

Additionally, there is a tiny German-English parallel corpus containing
6,802 sentence pairs extracted from bilingual quotations in the German
Wikipedia:
http://linguatools.org/tools/corpora/wikipedia-parallel-quotations-corpora/.

All corpora are released under a Creative Commons Attribution Share-alike
license and are freely available at http://linguatools.org/tools/corpora/.

Best regards,
Peter Kolb

-- 

Peter Kolb & Procházková GbR
Perleberger Str. 55
D-10559 Berlin

E-Mail: peter.k...@linguatools.org
Internet: http://www.linguatools.org

_______________________________________________
Mt-list site list
Mt-list@eamt.org
http://lists.eamt.org/mailman/listinfo/mt-list

[Mt-list] Release of Wikipedia-based monolingual, comparable, and parallel corpora

Reply via email to