This looks great, and it aligns with my own recent interest in large scale NLP with Hadoop, including working with Wikipedia. I'll look at it more closely later, but in principle I would be interested in having this brought into the OpenNLP project in some way!
On Tue, Jan 4, 2011 at 12:04 PM, Olivier Grisel <[email protected]>wrote: > Hi all, > > I have lately been working on a utility to automatically extract > annotated multilingual corpora for the Named Entity Recognition task > out of Wikipedia dumps. > > The tool is named pignlproc, is licensed under ASL2, available at > https://github.com/ogrisel/pignlproc and uses Apache Hadoop, Apache > Pig and Apache Whirr to perform the processing on a cluster of tens of > virtual machines on the Amazon EC2 cloud infrastructure (you can also > run it locally on a single machine of course). > > Here is a sample of the output on the French Wikipedia dump: > > http://pignlproc.s3.amazonaws.com/corpus/fr/opennlp_location/part-r-00000 > > You can replace "location" by "person" or "organization" in the > previous URL for more examples. You can also replace "part-r-00000" by > "part-r-000XX" to download larger chunks of the corpus. > > And here are some trained models (50 iteration on the first 3 chunks > of each corpus, i.e. ~ 100k annotated sentences for each type): > > http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-location.bin > http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-person.bin > http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-organization.bin > > It is possible to retrain those models on a larger subset of chunks by > allocating more than 2GB of heapspace to the OpenNLP CLI tool (I used > version 1.5.0). > > The corpus is quite noisy so the performance of the trained models is > not optimal (but better than nothing anyway). Here is the result of > evaluations on held out chunks of the corpus (+/- 0.02): > > - Location: > > Precision: 0.87 > Recall: 0.74 > F-Measure: 0.80 > > - Person: > > Precision: 0.80 > Recall: 0.68 > F-Measure: 0.74 > > - Organization: > > Precision: 0.80 > Recall: 0.65 > F-Measure: 0.72 > > If you would like to build new models for new entity types (based on > the DBpedia ontology) or other languages you can find some > documentation on how to fetch the data and setup an Hadoop / EC2 > cluster here: > > https://github.com/ogrisel/pignlproc/blob/master/README.md > https://github.com/ogrisel/pignlproc/wiki > > The pig scripts to build these models are rather short and simple to > understand: > > > https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/01_extract_sentences_with_links.pig > > https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig > > https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/03_join_by_type_and_convert.pig > > I plan to give more details in a blog post soon (tm). > > As always any feedback warmly welcome. If you think those pig > utilities would blend into the OpenNLP project, both me and my > employer (Nuxeo) would be glad to contribute it to the project to the > ASF. > > Cheers, > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > -- Jason Baldridge Assistant Professor, Department of Linguistics The University of Texas at Austin http://comp.ling.utexas.edu/people/jason_baldridge
