On 11/07/2013 10:58 AM, Jens Grivolla wrote:
I don't know specifically about NameFinderME, but with other
statistical NER systems I noticed that they tend to give a lot of
weight to the fact that a world has initial capitalization when making
the decision, often so much that it is the only feature that matters.
This is due to the fact that on cleanly written text (e.g. news
articles) this is an extremely reliable predictor. If you have other
kinds of text such as UGC (e.g. twitter) you need to train a model
using this kind of data and hope for the best. Accuracy will usually
be far below what is achieved on news articles.
Exactly. It is mostly a question of the training data, the English
SourceForge models are trained on news articles from the 90s. These
don't contain
lower cased or all upper cased names.
Jörn