Named Entity detection engine should filter out some obviously wrong text 
annotations
-------------------------------------------------------------------------------------

                 Key: STANBOL-320
                 URL: https://issues.apache.org/jira/browse/STANBOL-320
             Project: Stanbol
          Issue Type: Bug
            Reporter: Olivier Grisel
            Assignee: Olivier Grisel


OpenNLP tend to return really weird results from time to time. For instance:

"The researchers found the liver expresses higher levels of the gene encoding 
"selenoprotein P" (SEPP1) in people with type 2 diabetes - those with more 
insulin resistance." outputs a Person TextAnnotation for the mention 'P "' => 
note the double quote that is included as part the mention and the additional 
whitespace separator probably inserted by a confused detokenizer.

Here is another example:

"We are all very excited for Rahm as he takes on a new challenge for which he 
is extraordinarily well qualified," said the president. Obama appointed 
political consultant and senior advisor Pete Rouse as interim chief, calling 
Rouse "a skillful problem-solver" and a "wise, skillful and long-time 
counselor." => outputs 'Rouse "' as a Person annotation as well. This is again 
a confusion with a bad handling of quotation marks.

I would like to use this jira issue to collect most common annotation mistake 
that could be filtered using ad-hoc java code directly inside the enhancement 
engine.

For the too previous cases, removing the quotation marks and filtering single 
letter names should be enough. There might be other cases that don't match this 
simple pattern though. 


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to