Dear colleagues, we have released three types of corpora extracted from 23 language versions of Wikipedia:
1. Wikipedia Monolingual Corpora: more than 5 billion tokens of text in 23 languages extracted from the Wikipedia. The corpora are annotated with article and paragraph boundaries, number of incoming links for each article, anchor texts used to refer to each article (textlinks) and their frequencies, crosslanguage links, categories and more ( http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/). There is also a script that allows to extract domain-specific sub-corpora if you provide a list of desired categories. 2. Wikipedia Comparable Corpora: more than 41 million bilingually aligned Wikipedia articles for 253 language pairs ( http://linguatools.org/tools/corpora/wikipedia-comparable-corpora/). 3. Wikipedia Parallel Titles Corpora: bilingual titles of Wikipedia articles, extended with redirects and textlinks. 487,406,497 unique parallel segments for 253 language pairs ( http://linguatools.org/tools/corpora/wikipedia-parallel-titles-corpora/). Additionally, there is a tiny German-English parallel corpus containing 6,802 sentence pairs extracted from bilingual quotations in the German Wikipedia: http://linguatools.org/tools/corpora/wikipedia-parallel-quotations-corpora/. All corpora are released under a Creative Commons Attribution Share-alike license and are freely available at http://linguatools.org/tools/corpora/. Best regards, Peter Kolb -- Peter Kolb & Procházková GbR Perleberger Str. 55 D-10559 Berlin E-Mail: peter.k...@linguatools.org Internet: http://www.linguatools.org
_______________________________________________ Mt-list site list Mt-list@eamt.org http://lists.eamt.org/mailman/listinfo/mt-list