Rupert Westenthaler created STANBOL-1110:
--------------------------------------------

             Summary: Use Term Proximity for Searching Entities in the 
EntityhubLinkingEngine
                 Key: STANBOL-1110
                 URL: https://issues.apache.org/jira/browse/STANBOL-1110
             Project: Stanbol
          Issue Type: Improvement
          Components: Enhancement Engines
    Affects Versions: enhancement-engines-0.10.0
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


The issue with the ranking of the results of the EntityLinkingEngine is that 
some Entities had matching labels in both the language of the text as well as 
the fallback language. Other only in one of the two. As Background:

The EntityLinkingEngine perfoms queries like

    {lang1}:"{term1}" OR {lang1}:"{term2}" OR {lang2}:"{term1}" OR 
{lang2}:"{term2}"

when linking Entities. Where {lang1} is the language detected for the document 
and {lang2} is the default mapping language.

When executing such queries on the Entithub based EntitySearcher 
implementations of the EntityhubLinkingEngine the ranking of results where 
Entities only matching only one of the parsed terms are in front of some 
matching both therms.

The reason for that is that there are two possibilities how two of the four 
query terms can match

 (a) both {term1} and {term2} do match in the same language
 (b) a single term matches in {lang1} and {lang2}

While (a) is the matching expected by users (b) is not so unlikely. Especially 
if (a) is not a very famous entity and is missing translations of its labels to 
many languages and {term1} and/or {term2} is present in more famous entities 
that do have such translation. Most often this happens with given names of 
persons. 

As the EntityLinking engine only processes (for performance reasons) only the 
first few results (by default 2*maxSuggestions but at least 10)  this will 
cause Entities to be not linked because of the unintended ranking of results.

The new Proximity Ranking Feature (STANBOL-1105) can be used to solve this, as 
it ensures that Entities matching both terms in the same language (and 
therefore in the same label) will be ranked above those that match only a 
single term in two different languages.

This issue will enable the use of this feature for the EntityhubLinkingEngine


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to