On 10/10/2013 11:58 AM, Thomas Zastrow wrote:
Hello,

There seems to be no free German NE model available, so I started to think about creating one - just using free resources like Wikipedia etc.

I still have some questions:

Somewhere in the documnetation, I read about a dictionary driven NE recognizer in OpenNLP. But I didn't found any further information about it. Anyway, would it be possible to combine the statistic approach with dictionaries? For example, having a list of country names would be useful.


Yes that is possible, we have a DictionaryFeatureGenerator which can lookup names in a dictionary and produces features for them. There is an xml file you can create to describe how the feature generation should be setup for training, the file is then stored in the model to be able to reproduce the exact same feature generation when the model is loaded later.

See our documentation:
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.featuregen

What are the feature you would like to generate via the dictionary?

The Name Finder can be extended with custom feature generators, in case you have some ideas or just want to experiment a bit.

As far as I understood, the name finder is at the moment only stable for one property, like person names. I would like to have the traditional divison into persons, locations, organizations and misc. When creating manually the training data, would it be OK to add all four kinds already to the text and then, maybe create later 4 models for the different properties?

The name finder trainer by default trains a model for all name types occurring in the training data, the -nameTypes option can reduce the used types
to one or multiple. I often use this, it works great.

The name finder uses as input sentences and tokens. Would it be OK to also have POS tags assigned to the training data? That would make it much easier to manually annotate the data when e.g. NEs are already marked by the POS tagger.


Passing in pos tags is currently not supported by our API. The easiest way to get around that limitation is probably
to run the pos taggger as part of the name finder feature generation.

There is German CONLL training data you could use to train a name finder model:
http://www.cnts.ua.ac.be/conll2003/ner/

The OpenNLP Name Finder can be directly trained on the CONLL2003 data.

HTH,
Jörn

Reply via email to