[jira] [Resolved] (STANBOL-1104) Improve ranking for multi term OR queries over the SolrYard

Rupert Westenthaler (JIRA) Thu, 13 Jun 2013 05:42:02 -0700

     [ 
https://issues.apache.org/jira/browse/STANBOL-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rupert Westenthaler resolved STANBOL-1104.
------------------------------------------

    Resolution: Fixed

Implemented proximity based ranking by using Phrase queries (for 
TextConstraints) and Constraint Boosts with http://svn.apache.org/r1492591.
                
> Improve ranking for multi term OR queries over the SolrYard
> -----------------------------------------------------------
>
>                 Key: STANBOL-1104
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1104
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entityhub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Test for EntityLinking against big vocabularies (e.g. Freebase with about 40 
> million entities) have shown that the currently used Solr Queries for 
> multi-token OR queries do not always give the expected ranking of the results 
> because of the following reasons:
> ReferencedSite do use Entity rankings (implemented as index time Document 
> boosts). Those rankings do have an impact on the rankings of query results. 
> On the positive side those rankings ensure that a query for Paris should give 
> Paris, France before Paris, Texas. On the negative for a query for two tokens 
> (e.g. two given names) it might happen that other entities with only one of 
> those terms (e.g. very famous person with one of the two requested given 
> names) are ranked before entities with a lower ranking that do match both 
> terms.
> This is even more likely for terms that are very common in the index, as 
> normalization will reduce the boost for entities with such a term - resulting 
> in the document boost to have an even higher impact.
> The described behavior is especially a problem for the EntityLinkingEngine as 
> its uses exactly such kind of "{term1} OR {term2}" queries to lookup 
> Entities. 
> Possible Solutions include:
> * disable the use of index time document boosts: However this would have a 
> negative impact on every day searches (e.g. for Paris) and is therefore not 
> an option within most scenarios.
> * increase the number of selected entities for the EntityLinkingEngine: 
> currently max(10,2*maxSuggestion) entities are retrieved. Increasing this 
> value would make the engine more resistant to unexpected rankings. However 
> (1) it does not solve (but workaround) the problem; (2) some tests have shown 
> that even increasing the value to 50 does not include the expected result 
> (using the freebase.com index as dataset).
> * explicitly adding a "{token1} {token2}" query term in the 
> EntityLinkingEngine to queries for "{term1}" OR "{term2}". However this would 
> only boost entities where {token1} and {token2} would be in exact that order. 
> Entities containing "{token2} {token1}" or "{token1} {other} {token2} would 
> not get any boost. So this solution will only improve rankings for those 
> cases where the label would also match an AND connected query. 
> * the use of a "Term Proximity" as suggested by [1]:  This ensures that (1) 
> Entities that do only match one of the parsed terms will get no boost from 
> this part of the query, (2) even for entities that match several/all terms 
> the ranking will get improved as the distance within the text will be 
> considered for calculating the ranking. As phrase queries are more 
> complicated to answer it is expected that this will have an impact on the 
> performance.
> * Using a high query time boost for multi term OR queries as suggested by 
> [2]. This would allow to increase the boost given to entities containing 
> {token1} and {token2} and therefore reduce the influence of the index time 
> document boost used to represent the entity ranking. The advantage is that 
> this will not have any performance implications (as it only influences the 
> ranking computation and does not make the query more complex). 
> [1] http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity
> [2] http://wiki.apache.org/solr/SolrRelevancyCookbook#Ranking_Terms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (STANBOL-1104) Improve ranking for multi term OR queries over the SolrYard

Reply via email to