[
https://issues.apache.org/jira/browse/STANBOL-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler resolved STANBOL-1104.
------------------------------------------
Resolution: Fixed
Implemented proximity based ranking by using Phrase queries (for
TextConstraints) and Constraint Boosts with http://svn.apache.org/r1492591.
> Improve ranking for multi term OR queries over the SolrYard
> -----------------------------------------------------------
>
> Key: STANBOL-1104
> URL: https://issues.apache.org/jira/browse/STANBOL-1104
> Project: Stanbol
> Issue Type: Improvement
> Components: Entityhub
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> Test for EntityLinking against big vocabularies (e.g. Freebase with about 40
> million entities) have shown that the currently used Solr Queries for
> multi-token OR queries do not always give the expected ranking of the results
> because of the following reasons:
> ReferencedSite do use Entity rankings (implemented as index time Document
> boosts). Those rankings do have an impact on the rankings of query results.
> On the positive side those rankings ensure that a query for Paris should give
> Paris, France before Paris, Texas. On the negative for a query for two tokens
> (e.g. two given names) it might happen that other entities with only one of
> those terms (e.g. very famous person with one of the two requested given
> names) are ranked before entities with a lower ranking that do match both
> terms.
> This is even more likely for terms that are very common in the index, as
> normalization will reduce the boost for entities with such a term - resulting
> in the document boost to have an even higher impact.
> The described behavior is especially a problem for the EntityLinkingEngine as
> its uses exactly such kind of "{term1} OR {term2}" queries to lookup
> Entities.
> Possible Solutions include:
> * disable the use of index time document boosts: However this would have a
> negative impact on every day searches (e.g. for Paris) and is therefore not
> an option within most scenarios.
> * increase the number of selected entities for the EntityLinkingEngine:
> currently max(10,2*maxSuggestion) entities are retrieved. Increasing this
> value would make the engine more resistant to unexpected rankings. However
> (1) it does not solve (but workaround) the problem; (2) some tests have shown
> that even increasing the value to 50 does not include the expected result
> (using the freebase.com index as dataset).
> * explicitly adding a "{token1} {token2}" query term in the
> EntityLinkingEngine to queries for "{term1}" OR "{term2}". However this would
> only boost entities where {token1} and {token2} would be in exact that order.
> Entities containing "{token2} {token1}" or "{token1} {other} {token2} would
> not get any boost. So this solution will only improve rankings for those
> cases where the label would also match an AND connected query.
> * the use of a "Term Proximity" as suggested by [1]: This ensures that (1)
> Entities that do only match one of the parsed terms will get no boost from
> this part of the query, (2) even for entities that match several/all terms
> the ranking will get improved as the distance within the text will be
> considered for calculating the ranking. As phrase queries are more
> complicated to answer it is expected that this will have an impact on the
> performance.
> * Using a high query time boost for multi term OR queries as suggested by
> [2]. This would allow to increase the boost given to entities containing
> {token1} and {token2} and therefore reduce the influence of the index time
> document boost used to represent the entity ranking. The advantage is that
> this will not have any performance implications (as it only influences the
> ranking computation and does not make the query more complex).
> [1] http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity
> [2] http://wiki.apache.org/solr/SolrRelevancyCookbook#Ranking_Terms
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira