Hi all, I have lately been working on a utility to automatically extract annotated multilingual corpora for the Named Entity Recognition task out of Wikipedia dumps.
The tool is named pignlproc, is licensed under ASL2, available at https://github.com/ogrisel/pignlproc and uses Apache Hadoop, Apache Pig and Apache Whirr to perform the processing on a cluster of tens of virtual machines on the Amazon EC2 cloud infrastructure (you can also run it locally on a single machine of course). Here is a sample of the output on the French Wikipedia dump: http://pignlproc.s3.amazonaws.com/corpus/fr/opennlp_location/part-r-00000 You can replace "location" by "person" or "organization" in the previous URL for more examples. You can also replace "part-r-00000" by "part-r-000XX" to download larger chunks of the corpus. And here are some trained models (50 iteration on the first 3 chunks of each corpus, i.e. ~ 100k annotated sentences for each type): http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-location.bin http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-person.bin http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-organization.bin It is possible to retrain those models on a larger subset of chunks by allocating more than 2GB of heapspace to the OpenNLP CLI tool (I used version 1.5.0). The corpus is quite noisy so the performance of the trained models is not optimal (but better than nothing anyway). Here is the result of evaluations on held out chunks of the corpus (+/- 0.02): - Location: Precision: 0.87 Recall: 0.74 F-Measure: 0.80 - Person: Precision: 0.80 Recall: 0.68 F-Measure: 0.74 - Organization: Precision: 0.80 Recall: 0.65 F-Measure: 0.72 If you would like to build new models for new entity types (based on the DBpedia ontology) or other languages you can find some documentation on how to fetch the data and setup an Hadoop / EC2 cluster here: https://github.com/ogrisel/pignlproc/blob/master/README.md https://github.com/ogrisel/pignlproc/wiki The pig scripts to build these models are rather short and simple to understand: https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/01_extract_sentences_with_links.pig https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/03_join_by_type_and_convert.pig I plan to give more details in a blog post soon (tm). As always any feedback warmly welcome. If you think those pig utilities would blend into the OpenNLP project, both me and my employer (Nuxeo) would be glad to contribute it to the project to the ASF. Cheers, -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
