Rupert Westenthaler created STANBOL-1122:
--------------------------------------------
Summary: Only Tokens with a fully linked entity should be marked
as consumed
Key: STANBOL-1122
URL: https://issues.apache.org/jira/browse/STANBOL-1122
Project: Stanbol
Issue Type: Sub-task
Components: Enhancement Engines
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
The EntityLinking process makes Token that are already linked with an Entity as
"consumed".
Lets asume a text mentions:
"An airplane crashed in the northern part of the Democratic Republic of the
Congo"
In case Proper Noun linking is activated "Democratic" would be the first
"active" token within this sentence and ["Democratic", "Republic"] would be the
first "search tokens". Now lets assume that the vocabulary contains the Entity
"Democratic Republic of the Congo" and that is is returned by the
EntitySearcher for a query for ["Democratic", "Republic"].
So when the Entity "Democratic Republic of the Congo" is matched with the
sentence all tokens until "Congo" are marked as consumed. This ensures that
there are no further lookups for "Republic" nor "Congo".
While this is generally good suggested Entities that do exactly match the text
it is dangerous for partial matches as shown by the following example
"President Barack Obama said the US estimated ..."
If you link this text to Freebase, than "Presidency of Barack Obama"
(https://www.freebase.com/m/05b6w1g) will get linked for the section "President
Barack Obama". The match is "Particial" as only tree of the four tokens of the
label do match the Text and also the not exact match of "Presidency" with
"President" does reduce the confidence to an overall score of about 0.6
However the current algorithm would still mark "Barack" and "Obama" as consumed
and therefore prevent "Barack Obama" to be linked for this mention.
This issue will change this in a way that only FULL matches (where all tokens
in the label do match tokens in the text) will mark Entities in the text as
consumed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira