Hi all,

I have lately been working on a utility to automatically extract
annotated multilingual corpora for the Named Entity Recognition task
out of Wikipedia dumps.

The tool is named pignlproc, is licensed under ASL2, available at
https://github.com/ogrisel/pignlproc and uses Apache Hadoop, Apache
Pig and Apache Whirr to perform the processing on a cluster of tens of
virtual machines on the Amazon EC2 cloud infrastructure (you can also
run it locally on a single machine of course).

Here is a sample of the output on the French Wikipedia dump:

  http://pignlproc.s3.amazonaws.com/corpus/fr/opennlp_location/part-r-00000

You can replace "location" by "person" or "organization" in the
previous URL for more examples. You can also replace "part-r-00000" by
"part-r-000XX" to download larger chunks of the corpus.

And here are some trained models (50 iteration on the first 3 chunks
of each corpus, i.e. ~ 100k annotated sentences for each type):

http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-location.bin
http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-person.bin
http://pignlproc.s3.amazonaws.com/models/opennlp/fr-ner-organization.bin

It is possible to retrain those models on a larger subset of chunks by
allocating more than 2GB of heapspace to the OpenNLP CLI tool (I used
version 1.5.0).

The corpus is quite noisy so the performance of the trained models is
not optimal (but better than nothing anyway). Here is the result of
evaluations on held out chunks of the corpus (+/- 0.02):

- Location:

Precision: 0.87
Recall: 0.74
F-Measure: 0.80

- Person:

Precision: 0.80
Recall: 0.68
F-Measure: 0.74

- Organization:

Precision: 0.80
Recall: 0.65
F-Measure: 0.72

If you would like to build new models for new entity types (based on
the DBpedia ontology) or other languages you can find some
documentation on how to fetch the data and setup an Hadoop / EC2
cluster here:

  https://github.com/ogrisel/pignlproc/blob/master/README.md
  https://github.com/ogrisel/pignlproc/wiki

The pig scripts to build these models are rather short and simple to understand:

  
https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/01_extract_sentences_with_links.pig
  
https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig
  
https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/03_join_by_type_and_convert.pig

I plan to give more details in a blog post soon (tm).

As always any feedback warmly welcome. If you think those pig
utilities would blend into the OpenNLP project, both me and my
employer (Nuxeo) would be glad to contribute it to the project to the
ASF.

Cheers,

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to