On 07/17/2012 10:27 AM, Chi Dat Nguyen wrote:
After a while I figured out that the result provided by the pretrained
tokenizer causes this problem.
If "Mr. Vinken" is tokenized into 3 tokens "Mr", ".", "Vinken",
instead of 2 tokens, the Name Finder works perfectly.
It seems that the SimpleTokenizer is better than the pretrained
tokenizer in these cases.

Exactly, the English NER models on the sourceforge page are trained with
the SimpleTokenizer, so you need to use that to get good results.
Especially important context words like Mr. are tokenized differently
compared to the maxent based English tokenizer.

May I ask how we can use the optional parameters of
opennlp.uima.namefind.NameFinder: opennlp.uima.ProbabilityFeature,
opennlp.uima.BeamSize, opennlp.uima.DocumentConfidenceType?
I'm sorry for asking these kinds of questions. I just started to use
OpenNLP recently and there is nearly no documentation for OpenNLP UIMA
at all.

These are parameters of our UIMA integration. Do you use that?

You need to specify these parameters in the Analysis Engine descriptor
and assign an appropriate value. Beam size needs an integer, the probability feature is the name of the feature where the prop of a name can be assigned to (aka confidence). And the DocumentConfidenceType is a type of an FS which is created to contain the
confidence for a document.

Jörn

Reply via email to