[ 
https://issues.apache.org/jira/browse/STANBOL-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler updated STANBOL-1362:
-----------------------------------------

    Description: 
The FST linking engine uses the TagClusterReducer#LONGEST_DOMINANT_RIGHT to 
select the dominant Tag in an overlapping cluster of Tag suggestions.

While this algorithm is fine the span used as input are not ideal as also 
none-matchable tokens are considered. Especially when linking against DBPedia 
this sometimes results in unexpected results as several entities in DBPedia do 
have labels that include things like pre-/post-positions. Because of that a 
text mentioning "in <location>" could get linked to an entity with this name 
(could be a book or music-album) but not suggesting the <location>. This is 
because the matching span for "in <location>" is the LONGEST_DOMINANT_RIGHT and 
the match for the <location> will be removed. 

To fix this issue one needs to implement a LONGEST_DOMINANT_RIGHT variant that 
only considers the span of enclosed matchable tokens instead of the whole 
matching span. Doing so will only use <location> as matchable span and 
therefore suggest both the <location> and the other entity matching "in 
<location>".

  was:
The FST linking engine uses the TagClusterReducer#LONGEST_DOMINANT_RIGHT to 
select the dominant Tag in an overlapping cluster of Tag suggestions.

While this algorithm is fine the span used as input are not ideal as also 
none-matchable tokens are considered. Especially when linking against DBPedia 
this sometimes results in unexpected results as several entities in DBPedia do 
have labels that include things like pre-/post-positions. Because of that a 
text mentioning "in {location}" could get linked to an entity with this name 
(could be a {book} or {music-album}) but not suggesting the {location}. This is 
because the matching span for "in {location}" is the LONGEST_DOMINANT_RIGHT and 
the match for the {location} will be removed. 

To fix this issue one needs to implement a LONGEST_DOMINANT_RIGHT variant that 
only considers the span of enclosed matchable tokens instead of the whole 
matching span. Doing so will only use {location} as matchable span and 
therefore suggest both the {location} and the other entity matching "in 
{location}".


> FST linking engine should use the matchable span to calculate dominant tag 
> ---------------------------------------------------------------------------
>
>                 Key: STANBOL-1362
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1362
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Enhancement Engines
>    Affects Versions: 0.12.0
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>            Priority: Minor
>             Fix For: 1.0.0, 0.12.1
>
>
> The FST linking engine uses the TagClusterReducer#LONGEST_DOMINANT_RIGHT to 
> select the dominant Tag in an overlapping cluster of Tag suggestions.
> While this algorithm is fine the span used as input are not ideal as also 
> none-matchable tokens are considered. Especially when linking against DBPedia 
> this sometimes results in unexpected results as several entities in DBPedia 
> do have labels that include things like pre-/post-positions. Because of that 
> a text mentioning "in <location>" could get linked to an entity with this 
> name (could be a book or music-album) but not suggesting the <location>. This 
> is because the matching span for "in <location>" is the 
> LONGEST_DOMINANT_RIGHT and the match for the <location> will be removed. 
> To fix this issue one needs to implement a LONGEST_DOMINANT_RIGHT variant 
> that only considers the span of enclosed matchable tokens instead of the 
> whole matching span. Doing so will only use <location> as matchable span and 
> therefore suggest both the <location> and the other entity matching "in 
> <location>".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to