[
https://issues.apache.org/jira/browse/STANBOL-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler updated STANBOL-1102:
-----------------------------------------
Description:
With the "Max Search Tokens (enhancer.engines.linking.maxSearchTokens)"
configuration the EntityLinking Engine does support OR queries for multiple
linkable/matchable tokens to the controlled vocabulary (default=2).
This feature ensures that Entities that do match longer section in the text are
higher ranked. This is especially important for bigger vocabularies and/or
common tokens within the vocabulary as the EntityLinking only considers the top
10 (or 3 * max suggestions) query results.
However in case multiple Tokens are used for searches there might be
suggestions that do match some tokens in the Text, but not the currently active
one. Currently those suggestions are taken into account what can cause unwanted
states, like the one described in the following Example:
"Bei einer gmeinsamen Pressekonferenz mit FPÖ-Bundesparteivorsitzenden
Heinz-Christian Strache in Langenlois"
This generates the following queries
(1) process Token 5: FPÖ
>> searchStrings [FPÖ, Bundesparteivorsitzenden]
<< 0: FPÖ[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://rdf.freebase.com/ns/m.013vy8
(2) process Token 5: Bundesparteivorsitzenden
>> searchStrings [Bundesparteivorsitzenden, Heinz]
<< 0: Heinz[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://rdf.freebase.com/ns/m.0c5y96
(3) process Token 7: Christian
>> searchStrings [Christian, Strache]
<< 0: Heinz-Christian Strache[m=FULL,s=2,c=2(1.0)/3]
score=0.6666666666666666[l=0.6666666666666666,t=1.0] for
http://rdf.freebase.com/ns/m.08lfdk
resulting in a situation where Heinz is linked to an other Entity while
Heinz-Christian Strache - while completely matching the text - is only linked
with "Christian Strache" AND a lower confidence!
The issue is that search (2) issued for the Token "Bundesparteivorsitzenden"
MUST NOT suggest an Entity that does not match the currently active Token.
Because this is the case in the given Example "Heinz" is already consumed and
can not be linked with the expected Entity mention "Heinz-Christian Strache"
This issue will add a rule to the Label <-> Text matching that an Label MUST
match the currently active token in the text.
was:
With the "Max Search Tokens (enhancer.engines.linking.maxSearchTokens)"
configuration the EntityLinking Engine does support OR queries for multiple
linkable/matchable tokens to the controlled vocabulary (default=2).
This feature ensures that Entities that do match longer section in the text are
higher ranked. This is especially important for bigger vocabularies and/or
common tokens within the vocabulary as the EntityLinking only considers the top
10 (or 3 * max suggestions) query results.
However in cases where no Entities do match several tokens of the search this
feature currently causes unwanted side effects that is may match single tokens
that are not the currently active one.
E.g. the text section "Bei einer gmeinsamen Pressekonferenz mit
FPÖ-Bundesparteivorsitzenden Heinz-Christian Strache in Langenlois" generates
the following queries
(1) process Token 5: FPÖ
>> searchStrings [FPÖ, Bundesparteivorsitzenden]
<< 0: FPÖ[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://rdf.freebase.com/ns/m.013vy8
(2) process Token 5: Bundesparteivorsitzenden
>> searchStrings [Bundesparteivorsitzenden, Heinz]
<< 0: Heinz[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://rdf.freebase.com/ns/m.0c5y96
(3) process Token 7: Christian
>> searchStrings [Christian, Strache]
<< 0: Heinz-Christian Strache[m=FULL,s=2,c=2(1.0)/3]
score=0.6666666666666666[l=0.6666666666666666,t=1.0] for
http://rdf.freebase.com/ns/m.08lfdk
resulting in a situation where Heinz is linked to an other Entity while
Heinz-Christian Strache - while completely matching the text - is only linked
with "Christian Strache" AND a lower confidence!
The issue is that search (2) issued for the Token "Bundesparteivorsitzenden"
MUST NOT suggest an Entity that does not match the currently active Token.
Because this is the case in the given Example "Heinz" is already consumed and
can not be linked with the expected Entity mention "Heinz-Christian Strache"
This issue will add a rule to EntityLinking that the currently active Token
need to be included in suggestions.
> EntityLinking MUST only accept Suggestions for the current active Token
> -----------------------------------------------------------------------
>
> Key: STANBOL-1102
> URL: https://issues.apache.org/jira/browse/STANBOL-1102
> Project: Stanbol
> Issue Type: Sub-task
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> With the "Max Search Tokens (enhancer.engines.linking.maxSearchTokens)"
> configuration the EntityLinking Engine does support OR queries for multiple
> linkable/matchable tokens to the controlled vocabulary (default=2).
> This feature ensures that Entities that do match longer section in the text
> are higher ranked. This is especially important for bigger vocabularies
> and/or common tokens within the vocabulary as the EntityLinking only
> considers the top 10 (or 3 * max suggestions) query results.
> However in case multiple Tokens are used for searches there might be
> suggestions that do match some tokens in the Text, but not the currently
> active one. Currently those suggestions are taken into account what can cause
> unwanted states, like the one described in the following Example:
> "Bei einer gmeinsamen Pressekonferenz mit FPÖ-Bundesparteivorsitzenden
> Heinz-Christian Strache in Langenlois"
> This generates the following queries
> (1) process Token 5: FPÖ
> >> searchStrings [FPÖ, Bundesparteivorsitzenden]
> << 0: FPÖ[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> http://rdf.freebase.com/ns/m.013vy8
> (2) process Token 5: Bundesparteivorsitzenden
> >> searchStrings [Bundesparteivorsitzenden, Heinz]
> << 0: Heinz[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> http://rdf.freebase.com/ns/m.0c5y96
> (3) process Token 7: Christian
> >> searchStrings [Christian, Strache]
> << 0: Heinz-Christian Strache[m=FULL,s=2,c=2(1.0)/3]
> score=0.6666666666666666[l=0.6666666666666666,t=1.0] for
> http://rdf.freebase.com/ns/m.08lfdk
> resulting in a situation where Heinz is linked to an other Entity while
> Heinz-Christian Strache - while completely matching the text - is only linked
> with "Christian Strache" AND a lower confidence!
> The issue is that search (2) issued for the Token "Bundesparteivorsitzenden"
> MUST NOT suggest an Entity that does not match the currently active Token.
> Because this is the case in the given Example "Heinz" is already consumed and
> can not be linked with the expected Entity mention "Heinz-Christian Strache"
> This issue will add a rule to the Label <-> Text matching that an Label MUST
> match the currently active token in the text.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira