[
https://issues.apache.org/jira/browse/SOLR-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018260#comment-18018260
]
David Smiley commented on SOLR-17319:
-------------------------------------
Way 1: You mention that Chris referenced this but I want to mention that he
did so negatively, not in favor. (AFAICT) Quoting him:
{quote}Isn't any approach that computes a "score" based on the RRF _per shard_
and _then_ merges the per-shard results to find the topN results by defiition
"wrong" according to the RRF formula? (or at least: the RRF formula as i
understand it?)
{quote}
I completely agree with Chris, and I affirmed equivalent statements in my last
summary. Only "Way 2" is correct. I'm not sure I understand/agree with what
you said about "Way 2" so I'd like to offer what I think it'd look like:
Way 2: A new QueryComponent subclass or collaborating SearchComponent shall
arrange to execute the sub-queries concurrently using distributed-search (thus
across shards) to get a complete (whole corpus) ranked list of offset+rows docs
of them. It then shall merge and rank them according to RRF. Then consider
offset & rows to derive the correct DocSlice (page).
Details:
* I'm not sure if the SearchComponent distributed-search protocol/API can
process sub-queries in parallel somehow. It can do shards in parallel but not
sure about N sub-queries in parallel. It's a complicated under-documented
protocol as well. But certainly a component could use a ShardHandler's
executor to independently do the requests, maybe using EmbeddedSolrServer to
talk to the current core.
* I recommend forcing distributed-search / shortCircuit=false somehow so that
you basically have one distributed implementation to code/maintain/test instead
of two, thus not doing a separate single-shard optimized code path.
* Faceting or other queries requiring a DocSet could be initially unsupported
and added later. A trick would be to participate in the distributed-search
protocol but exclude the DocSlice (e.g. by setting rows=0) since that portion
of the results is handled separately with the sub-queries mechanism just
described; we don't need/want QueryComponent to get the top docs on its own.
The sharded query must do a disjunction of the sub-queries (logical OR) when it
needs the DocSet, like by simply setting the query to be that.
* I think there's less concern of interfacing with QueryComponent's existing
code / code-duplication concerns.
> Introduce support for Reciprocal Rank Fusion (combining queries)
> ----------------------------------------------------------------
>
> Key: SOLR-17319
> URL: https://issues.apache.org/jira/browse/SOLR-17319
> Project: Solr
> Issue Type: New Feature
> Components: vector-search
> Affects Versions: 9.6.1
> Reporter: Alessandro Benedetti
> Assignee: Alessandro Benedetti
> Priority: Major
> Labels: pull-request-available
> Time Spent: 23h 10m
> Remaining Estimate: 0h
>
> Reciprocal Rank Fusion (RRF) is an algorithm that takes in input multiple
> ranked lists to produce a unified result set.
> Examples of use cases where RRF can be used include hybrid search and
> multiple Knn vector queries executed concurrently.
> RRF is based on the concept of reciprocal rank, which is the inverse of the
> rank of a document in a ranked list of search results.
> The combination of search results happens taking into account the position of
> the items in the original rankings, and giving higher score to items that
> are ranked higher in multiple lists. RRF was introduced the first time by
> Cormack et al. in [1].
> The syntax proposed:
> JSON Request
> {code:json}
> {
> "queries": {
> "lexical1": {
> "lucene": {
> "query": "id:(10^=2 OR 2^=1 OR 4^=0.5)"
> }
> },
> "lexical2": {
> "lucene": {
> "query": "id:(2^=2 OR 4^=1 OR 3^=0.5)"
> }
> }
> },
> "limit": 10,
> "fields": "[id,score]",
> "params": {
> "combiner": true,
> "combiner.upTo": 5,
> "facet": true,
> "facet.field": "id",
> "facet.mincount": 1
> }
> }
> {code}
> [1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet
> and individual rank learning methods.” Proceedings of the 32nd international
> ACM SIGIR conference on Research and development in information retrieval
> (2009)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]