Hi, Interesting! I'll definitely have a closer look at this and see if / how pignlproc could be a good match with Behemoth ( https://github.com/jnioche/behemoth). Speaking of which, I'll probably write an openNLP wrapper for Behemoth at some point. Feel free to get in touch if this is of interest.
Oh, and congrats on the Apache Incubation! Julien On 4 January 2011 18:04, Olivier Grisel <[email protected]> wrote: > Hi all, > > I have lately been working on a utility to automatically extract > annotated multilingual corpora for the Named Entity Recognition task > out of Wikipedia dumps. > > The tool is named pignlproc, is licensed under ASL2, available at > https://github.com/ogrisel/pignlproc and uses Apache Hadoop, Apache > Pig and Apache Whirr to perform the processing on a cluster of tens of > virtual machines on the Amazon EC2 cloud infrastructure (you can also > run it locally on a single machine of course). > > Here is a sample of the output on the French Wikipedia dump: > > http://pignlproc.s3.amazonaws.com/corpus/fr/opennlp_location/part-r-00000 > > You can replace "location" by "person" or "organization" in the > previous URL for more examples. You can also replace "part-r-00000" by > "part-r-000XX" to download larger chunks of the corpus. > > And here are some trained models (50 iteration on the first 3 chunks > of each corpus, i.e. ~ 100k annotated sentences for each type): > > http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-location.bin > http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-person.bin > http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-organization.bin > > It is possible to retrain those models on a larger subset of chunks by > allocating more than 2GB of heapspace to the OpenNLP CLI tool (I used > version 1.5.0). > > The corpus is quite noisy so the performance of the trained models is > not optimal (but better than nothing anyway). Here is the result of > evaluations on held out chunks of the corpus (+/- 0.02): > > - Location: > > Precision: 0.87 > Recall: 0.74 > F-Measure: 0.80 > > - Person: > > Precision: 0.80 > Recall: 0.68 > F-Measure: 0.74 > > - Organization: > > Precision: 0.80 > Recall: 0.65 > F-Measure: 0.72 > > If you would like to build new models for new entity types (based on > the DBpedia ontology) or other languages you can find some > documentation on how to fetch the data and setup an Hadoop / EC2 > cluster here: > > https://github.com/ogrisel/pignlproc/blob/master/README.md > https://github.com/ogrisel/pignlproc/wiki > > The pig scripts to build these models are rather short and simple to > understand: > > > https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/01_extract_sentences_with_links.pig > > https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig > > https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/03_join_by_type_and_convert.pig > > I plan to give more details in a blog post soon (tm). > > As always any feedback warmly welcome. If you think those pig > utilities would blend into the OpenNLP project, both me and my > employer (Nuxeo) would be glad to contribute it to the project to the > ASF. > > Cheers, > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
