[
https://issues.apache.org/jira/browse/STANBOL-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler updated STANBOL-1104:
-----------------------------------------
Summary: Improve ranking for multi term OR queries over the SolrYard (was:
Use Phrase queries for OR query terms in the SolrYard)
> Improve ranking for multi term OR queries over the SolrYard
> -----------------------------------------------------------
>
> Key: STANBOL-1104
> URL: https://issues.apache.org/jira/browse/STANBOL-1104
> Project: Stanbol
> Issue Type: Improvement
> Components: Entityhub
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> Test for EntityLinking against big vocabularies (e.g. Freebase with about 40
> million entities) have shown that the currently used Solr Queries for
> multi-token OR queries do not always give the expected ranking of the results
> because of the following reasons:
> ReferencedSite do use Entity rankings (implemented as index time Document
> boosts). Those rankings do have an impact on the rankings of query results.
> On the positive side those rankings ensure that a query for Paris should give
> Paris, France before Paris, Texas. On the negative for a query for two tokens
> (e.g. two given names) it might happen that other entities with only one of
> those terms (e.g. very famous person with one of the two requested given
> names) are ranked before entities with a lower ranking that do match both
> terms.
> This is even more likely for terms that are very common in the index, as
> normalization will reduce the boost for entities with such a term - resulting
> in the document boost to have an even higher impact.
> The described behavior is especially a problem for the EntityLinkingEngine as
> its uses exactly such kind of "{term1} OR {term2}" queries to lookup
> Entities.
> The use of a "Term Proximity" as suggested by [1] is clearly the best option
> to work around the stated problem: (1) Entities that do only match one of the
> parsed terms will get no boost from this part of the query, (2) even for
> entities that match several/all terms the ranking will get improved as the
> distance within the text will be considered for calculating the ranking.
> However this will also have the consequence that queries for multiple OR
> connected terms will be more complex and need some additional time to
> process. The impact of this additional complexity will need to be
> investigated further.
> Possible other Workarounds:
> * disable the use of index time document boosts: However this would have a
> negative impact on every day searches (e.g. for Paris) and is therefore not
> an option within most scenarios.
> * increase the number of selected entities for the EntityLinkingEngine:
> currently max(10,2*maxSuggestion) entities are retrieved. Increasing this
> value would make the engine more resistant to unexpected rankings. However
> (1) it does not solve (but workaround) the problem; (2) some tests have shown
> that even increasing the value to 50 does not include the expected result
> (using the freebase.com index as dataset).
> * explicitly adding a "{token1} {token2}" query term in the
> EntityLinkingEngine to queries for "{term1}" OR "{term2}". However this would
> only boost entities where {token1} and {token2} would be in exact that order.
> Entities containing "{token2} {token1}" or "{token1} {other} {token2} would
> not get any boost.
> * Using a high query time boost for multi term OR queries as suggested by
> [2]. This would allow to increase the boost given to entities containing
> {token1} and {token2} and therefore reduce the influence of the index time
> document boost used to represent the entity ranking. The advantage is that
> this will not have any performance implications (as it only influences the
> ranking computation and does not make the query more complex).
> So if the performance overhead allows to use of phrase queries this should be
> enabled for the Entityhub SolrYard. In case this has a considerable
> performance overhead this should become a new option that can be
> activated/deactivated.
> [1] http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity
> [2] http://wiki.apache.org/solr/SolrRelevancyCookbook#Ranking_Terms
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira