Named Entity detection engine should filter out some obviously wrong text
annotations
-------------------------------------------------------------------------------------
Key: STANBOL-320
URL: https://issues.apache.org/jira/browse/STANBOL-320
Project: Stanbol
Issue Type: Bug
Reporter: Olivier Grisel
Assignee: Olivier Grisel
OpenNLP tend to return really weird results from time to time. For instance:
"The researchers found the liver expresses higher levels of the gene encoding
"selenoprotein P" (SEPP1) in people with type 2 diabetes - those with more
insulin resistance." outputs a Person TextAnnotation for the mention 'P "' =>
note the double quote that is included as part the mention and the additional
whitespace separator probably inserted by a confused detokenizer.
Here is another example:
"We are all very excited for Rahm as he takes on a new challenge for which he
is extraordinarily well qualified," said the president. Obama appointed
political consultant and senior advisor Pete Rouse as interim chief, calling
Rouse "a skillful problem-solver" and a "wise, skillful and long-time
counselor." => outputs 'Rouse "' as a Person annotation as well. This is again
a confusion with a bad handling of quotation marks.
I would like to use this jira issue to collect most common annotation mistake
that could be filtered using ad-hoc java code directly inside the enhancement
engine.
For the too previous cases, removing the quotation marks and filtering single
letter names should be enough. There might be other cases that don't match this
simple pattern though.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira