Rupert Westenthaler created STANBOL-1362:
--------------------------------------------
Summary: FST linking engine should use the matchable span to
calculate dominant tag
Key: STANBOL-1362
URL: https://issues.apache.org/jira/browse/STANBOL-1362
Project: Stanbol
Issue Type: Improvement
Components: Enhancement Engines
Affects Versions: 0.12.0
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
Priority: Minor
Fix For: 1.0.0, 0.12.1
The FST linking engine uses the TagClusterReducer#LONGEST_DOMINANT_RIGHT to
select the dominant Tag in an overlapping cluster of Tag suggestions.
While this algorithm is fine the span used as input are not ideal as also
none-matchable tokens are considered. Especially when linking against DBPedia
this sometimes results in unexpected results as several entities in DBPedia do
have labels that include things like pre-/post-positions. Because of that a
text mentioning "in {location}" could get linked to an entity with this name
(could be a {book} or {music-album}) but not suggesting the {location}. This is
because the matching span for "in {location}" is the LONGEST_DOMINANT_RIGHT and
the match for the {location} will be removed.
To fix this issue one needs to implement a LONGEST_DOMINANT_RIGHT variant that
only considers the span of enclosed matchable tokens instead of the whole
matching span. Doing so will only use {location} as matchable span and
therefore suggest both the {location} and the other entity matching "in
{location}".
--
This message was sent by Atlassian JIRA
(v6.2#6252)