[
https://issues.apache.org/jira/browse/STANBOL-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler updated STANBOL-1104:
-----------------------------------------
Description:
Test for EntityLinking against big vocabularies (e.g. Freebase with about 40
million entities) have shown that the currently used Solr Queries for
multi-token OR queries do not always give the expected ranking of the results
because of the following reasons:
ReferencedSite do use Entity rankings (implemented as index time Document
boosts). Those rankings do have an impact on the rankings of query results. On
the positive side those rankings ensure that a query for Paris should give
Paris, France before Paris, Texas. On the negative for a query for two tokens
(e.g. two given names) it might happen that other entities with only one of
those terms (e.g. very famous person with one of the two requested given names)
are ranked before entities with a lower ranking that do match both terms.
This is even more likely for terms that are very common in the index, as
normalization will reduce the boost for entities with such a term - resulting
in the document boost to have an even higher impact.
The described behavior is especially a problem for the EntityLinkingEngine as
its uses exactly such kind of "{term1} OR {term2}" queries to lookup Entities.
Possible Solutions include:
* disable the use of index time document boosts: However this would have a
negative impact on every day searches (e.g. for Paris) and is therefore not an
option within most scenarios.
* increase the number of selected entities for the EntityLinkingEngine:
currently max(10,2*maxSuggestion) entities are retrieved. Increasing this value
would make the engine more resistant to unexpected rankings. However (1) it
does not solve (but workaround) the problem; (2) some tests have shown that
even increasing the value to 50 does not include the expected result (using the
freebase.com index as dataset).
* explicitly adding a "{token1} {token2}" query term in the EntityLinkingEngine
to queries for "{term1}" OR "{term2}". However this would only boost entities
where {token1} and {token2} would be in exact that order. Entities containing
"{token2} {token1}" or "{token1} {other} {token2} would not get any boost. So
this solution will only improve rankings for those cases where the label would
also match an AND connected query.
* the use of a "Term Proximity" as suggested by [1]: This ensures that (1)
Entities that do only match one of the parsed terms will get no boost from this
part of the query, (2) even for entities that match several/all terms the
ranking will get improved as the distance within the text will be considered
for calculating the ranking. As phrase queries are more complicated to answer
it is expected that this will have an impact on the performance.
* Using a high query time boost for multi term OR queries as suggested by [2].
This would allow to increase the boost given to entities containing {token1}
and {token2} and therefore reduce the influence of the index time document
boost used to represent the entity ranking. The advantage is that this will not
have any performance implications (as it only influences the ranking
computation and does not make the query more complex).
[1] http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity
[2] http://wiki.apache.org/solr/SolrRelevancyCookbook#Ranking_Terms
was:
Test for EntityLinking against big vocabularies (e.g. Freebase with about 40
million entities) have shown that the currently used Solr Queries for
multi-token OR queries do not always give the expected ranking of the results
because of the following reasons:
ReferencedSite do use Entity rankings (implemented as index time Document
boosts). Those rankings do have an impact on the rankings of query results. On
the positive side those rankings ensure that a query for Paris should give
Paris, France before Paris, Texas. On the negative for a query for two tokens
(e.g. two given names) it might happen that other entities with only one of
those terms (e.g. very famous person with one of the two requested given names)
are ranked before entities with a lower ranking that do match both terms.
This is even more likely for terms that are very common in the index, as
normalization will reduce the boost for entities with such a term - resulting
in the document boost to have an even higher impact.
The described behavior is especially a problem for the EntityLinkingEngine as
its uses exactly such kind of "{term1} OR {term2}" queries to lookup Entities.
The use of a "Term Proximity" as suggested by [1] is clearly the best option to
work around the stated problem: (1) Entities that do only match one of the
parsed terms will get no boost from this part of the query, (2) even for
entities that match several/all terms the ranking will get improved as the
distance within the text will be considered for calculating the ranking.
However this will also have the consequence that queries for multiple OR
connected terms will be more complex and need some additional time to process.
The impact of this additional complexity will need to be investigated further.
Possible other Workarounds:
* disable the use of index time document boosts: However this would have a
negative impact on every day searches (e.g. for Paris) and is therefore not an
option within most scenarios.
* increase the number of selected entities for the EntityLinkingEngine:
currently max(10,2*maxSuggestion) entities are retrieved. Increasing this value
would make the engine more resistant to unexpected rankings. However (1) it
does not solve (but workaround) the problem; (2) some tests have shown that
even increasing the value to 50 does not include the expected result (using the
freebase.com index as dataset).
* explicitly adding a "{token1} {token2}" query term in the EntityLinkingEngine
to queries for "{term1}" OR "{term2}". However this would only boost entities
where {token1} and {token2} would be in exact that order. Entities containing
"{token2} {token1}" or "{token1} {other} {token2} would not get any boost.
* Using a high query time boost for multi term OR queries as suggested by [2].
This would allow to increase the boost given to entities containing {token1}
and {token2} and therefore reduce the influence of the index time document
boost used to represent the entity ranking. The advantage is that this will not
have any performance implications (as it only influences the ranking
computation and does not make the query more complex).
So if the performance overhead allows to use of phrase queries this should be
enabled for the Entityhub SolrYard. In case this has a considerable performance
overhead this should become a new option that can be activated/deactivated.
[1] http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity
[2] http://wiki.apache.org/solr/SolrRelevancyCookbook#Ranking_Terms
> Improve ranking for multi term OR queries over the SolrYard
> -----------------------------------------------------------
>
> Key: STANBOL-1104
> URL: https://issues.apache.org/jira/browse/STANBOL-1104
> Project: Stanbol
> Issue Type: Improvement
> Components: Entityhub
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> Test for EntityLinking against big vocabularies (e.g. Freebase with about 40
> million entities) have shown that the currently used Solr Queries for
> multi-token OR queries do not always give the expected ranking of the results
> because of the following reasons:
> ReferencedSite do use Entity rankings (implemented as index time Document
> boosts). Those rankings do have an impact on the rankings of query results.
> On the positive side those rankings ensure that a query for Paris should give
> Paris, France before Paris, Texas. On the negative for a query for two tokens
> (e.g. two given names) it might happen that other entities with only one of
> those terms (e.g. very famous person with one of the two requested given
> names) are ranked before entities with a lower ranking that do match both
> terms.
> This is even more likely for terms that are very common in the index, as
> normalization will reduce the boost for entities with such a term - resulting
> in the document boost to have an even higher impact.
> The described behavior is especially a problem for the EntityLinkingEngine as
> its uses exactly such kind of "{term1} OR {term2}" queries to lookup
> Entities.
> Possible Solutions include:
> * disable the use of index time document boosts: However this would have a
> negative impact on every day searches (e.g. for Paris) and is therefore not
> an option within most scenarios.
> * increase the number of selected entities for the EntityLinkingEngine:
> currently max(10,2*maxSuggestion) entities are retrieved. Increasing this
> value would make the engine more resistant to unexpected rankings. However
> (1) it does not solve (but workaround) the problem; (2) some tests have shown
> that even increasing the value to 50 does not include the expected result
> (using the freebase.com index as dataset).
> * explicitly adding a "{token1} {token2}" query term in the
> EntityLinkingEngine to queries for "{term1}" OR "{term2}". However this would
> only boost entities where {token1} and {token2} would be in exact that order.
> Entities containing "{token2} {token1}" or "{token1} {other} {token2} would
> not get any boost. So this solution will only improve rankings for those
> cases where the label would also match an AND connected query.
> * the use of a "Term Proximity" as suggested by [1]: This ensures that (1)
> Entities that do only match one of the parsed terms will get no boost from
> this part of the query, (2) even for entities that match several/all terms
> the ranking will get improved as the distance within the text will be
> considered for calculating the ranking. As phrase queries are more
> complicated to answer it is expected that this will have an impact on the
> performance.
> * Using a high query time boost for multi term OR queries as suggested by
> [2]. This would allow to increase the boost given to entities containing
> {token1} and {token2} and therefore reduce the influence of the index time
> document boost used to represent the entity ranking. The advantage is that
> this will not have any performance implications (as it only influences the
> ranking computation and does not make the query more complex).
> [1] http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity
> [2] http://wiki.apache.org/solr/SolrRelevancyCookbook#Ranking_Terms
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira