On 10/10/2013 11:58 AM, Thomas Zastrow wrote:
Hello,
There seems to be no free German NE model available, so I started to
think about creating one - just using free resources like Wikipedia etc.
I still have some questions:
Somewhere in the documnetation, I read about a dictionary driven NE
recognizer in OpenNLP. But I didn't found any further information
about it. Anyway, would it be possible to combine the statistic
approach with dictionaries? For example, having a list of country
names would be useful.
Yes that is possible, we have a DictionaryFeatureGenerator which can
lookup names in a dictionary and produces features for them.
There is an xml file you can create to describe how the feature
generation should be setup for training, the file is then stored in the
model
to be able to reproduce the exact same feature generation when the model
is loaded later.
See our documentation:
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.featuregen
What are the feature you would like to generate via the dictionary?
The Name Finder can be extended with custom feature generators, in case
you have some ideas or just want to experiment a bit.
As far as I understood, the name finder is at the moment only stable
for one property, like person names. I would like to have the
traditional divison into persons, locations, organizations and misc.
When creating manually the training data, would it be OK to add all
four kinds already to the text and then, maybe create later 4 models
for the different properties?
The name finder trainer by default trains a model for all name types
occurring in the training data, the -nameTypes option can reduce the
used types
to one or multiple. I often use this, it works great.
The name finder uses as input sentences and tokens. Would it be OK to
also have POS tags assigned to the training data? That would make it
much easier to manually annotate the data when e.g. NEs are already
marked by the POS tagger.
Passing in pos tags is currently not supported by our API. The easiest
way to get around that limitation is probably
to run the pos taggger as part of the name finder feature generation.
There is German CONLL training data you could use to train a name finder
model:
http://www.cnts.ua.ac.be/conll2003/ner/
The OpenNLP Name Finder can be directly trained on the CONLL2003 data.
HTH,
Jörn