[ 
https://issues.apache.org/jira/browse/STANBOL-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler updated STANBOL-1104:
-----------------------------------------

    Summary: Improve ranking for multi term OR queries over the SolrYard  (was: 
Use Phrase queries for OR query terms in the SolrYard)
    
> Improve ranking for multi term OR queries over the SolrYard
> -----------------------------------------------------------
>
>                 Key: STANBOL-1104
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1104
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entityhub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Test for EntityLinking against big vocabularies (e.g. Freebase with about 40 
> million entities) have shown that the currently used Solr Queries for 
> multi-token OR queries do not always give the expected ranking of the results 
> because of the following reasons:
> ReferencedSite do use Entity rankings (implemented as index time Document 
> boosts). Those rankings do have an impact on the rankings of query results. 
> On the positive side those rankings ensure that a query for Paris should give 
> Paris, France before Paris, Texas. On the negative for a query for two tokens 
> (e.g. two given names) it might happen that other entities with only one of 
> those terms (e.g. very famous person with one of the two requested given 
> names) are ranked before entities with a lower ranking that do match both 
> terms.
> This is even more likely for terms that are very common in the index, as 
> normalization will reduce the boost for entities with such a term - resulting 
> in the document boost to have an even higher impact.
> The described behavior is especially a problem for the EntityLinkingEngine as 
> its uses exactly such kind of "{term1} OR {term2}" queries to lookup 
> Entities. 
> The use of a "Term Proximity" as suggested by [1] is clearly the best option 
> to work around the stated problem: (1) Entities that do only match one of the 
> parsed terms will get no boost from this part of the query, (2) even for 
> entities that match several/all terms the ranking will get improved as the 
> distance within the text will be considered for calculating the ranking.
> However this will also have the consequence that queries for multiple OR 
> connected terms will be more complex and need some additional time to 
> process. The impact of this additional complexity will need to be 
> investigated further.
> Possible other Workarounds:
> * disable the use of index time document boosts: However this would have a 
> negative impact on every day searches (e.g. for Paris) and is therefore not 
> an option within most scenarios.
> * increase the number of selected entities for the EntityLinkingEngine: 
> currently max(10,2*maxSuggestion) entities are retrieved. Increasing this 
> value would make the engine more resistant to unexpected rankings. However 
> (1) it does not solve (but workaround) the problem; (2) some tests have shown 
> that even increasing the value to 50 does not include the expected result 
> (using the freebase.com index as dataset).
> * explicitly adding a "{token1} {token2}" query term in the 
> EntityLinkingEngine to queries for "{term1}" OR "{term2}". However this would 
> only boost entities where {token1} and {token2} would be in exact that order. 
> Entities containing "{token2} {token1}" or "{token1} {other} {token2} would 
> not get any boost.
> * Using a high query time boost for multi term OR queries as suggested by 
> [2]. This would allow to increase the boost given to entities containing 
> {token1} and {token2} and therefore reduce the influence of the index time 
> document boost used to represent the entity ranking. The advantage is that 
> this will not have any performance implications (as it only influences the 
> ranking computation and does not make the query more complex). 
> So if the performance overhead allows to use of phrase queries this should be 
> enabled for the Entityhub SolrYard. In case this has a considerable 
> performance overhead this should become a new option that can be 
> activated/deactivated.
> [1] http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity
> [2] http://wiki.apache.org/solr/SolrRelevancyCookbook#Ranking_Terms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to