[
https://issues.apache.org/jira/browse/STANBOL-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096102#comment-13096102
]
Olivier Grisel commented on STANBOL-320:
----------------------------------------
Another example, this time the title page of a scientific paper:
Semantic Relation Extraction With Kernels Over Typed
Dependency Trees
Frank Reichartz
Hannes Korte
Gerhard Paass
Fraunhofer IAIS
Schloss Birlinghoven
St. Augustin, Germany
=> OpenNLP outputs a single annotation of type person: "Frank Reichartz Hannes
Korte Gerhard Paass Fraunhofer IAIS". In this case we could avoid such false
positives with a single rule that discards person names with more than 4 or 5
words or more than 50 chars for instance.
> Named Entity detection engine should filter out some obviously wrong text
> annotations
> -------------------------------------------------------------------------------------
>
> Key: STANBOL-320
> URL: https://issues.apache.org/jira/browse/STANBOL-320
> Project: Stanbol
> Issue Type: Bug
> Reporter: Olivier Grisel
> Assignee: Olivier Grisel
>
> OpenNLP tend to return really weird results from time to time. For instance:
> "The researchers found the liver expresses higher levels of the gene encoding
> "selenoprotein P" (SEPP1) in people with type 2 diabetes - those with more
> insulin resistance." outputs a Person TextAnnotation for the mention 'P "' =>
> note the double quote that is included as part the mention and the additional
> whitespace separator probably inserted by a confused detokenizer.
> Here is another example:
> "We are all very excited for Rahm as he takes on a new challenge for which he
> is extraordinarily well qualified," said the president. Obama appointed
> political consultant and senior advisor Pete Rouse as interim chief, calling
> Rouse "a skillful problem-solver" and a "wise, skillful and long-time
> counselor." => outputs 'Rouse "' as a Person annotation as well. This is
> again a confusion with a bad handling of quotation marks.
> I would like to use this jira issue to collect most common annotation mistake
> that could be filtered using ad-hoc java code directly inside the enhancement
> engine.
> For the too previous cases, removing the quotation marks and filtering single
> letter names should be enough. There might be other cases that don't match
> this simple pattern though.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira