[ 
https://issues.apache.org/jira/browse/SOLR-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855756#comment-17855756
 ] 

Chris M. Hostetter commented on SOLR-17319:
-------------------------------------------

{quote}Hi [~hossman], I suggest we move all discussions on the Pull Request 
where me and [~dsmiley] are currently iterating?

Thanks for your input by the way, let's see how the PR evolves and take it from 
there!
{quote}
That implies all of the discussion is specific to that PR – which is one of the 
main aspects of the "Let's do everything in PRs" dev approach that I hate.


I have not reviewed the PR, and i am not commenting on the PR specifically – i 
am commenting on the _IDEA_ tracked by this issue, which the PR is attempting 
to implement – but even if you abandon your PR, or some alternative/competing 
PR(s) come along and also attempt to implement the idea tracked here, my 
comments/concerns would still be applicable the _ideas_ being discussed.
----
{quote}Combining query results (such as Reciprocal Rank Fusion) can happen per 
node(1) or in the coordinator (2) (after results are already merged from 
different shards).
(1) is not ideal in my opinion, but I don't think it's strictly wrong: ...
...
{quote}
Maybe I'm misunderstanding something, but my naive understanding of the RRF 
formula (based on some limited googling/reading) is that in order to compute 
the combined rank of any document Dx against queries Qa, Qb, Qc, etc... you 
need to know the rank order of Dx (in the entire corpus) against each of those 
individual queries. how can you possibly know that on a per-shard basis?

Isn't any approach that computes a "score" based on the RRF _per shard_ and 
_then_ merges the per-shard results to find the topN results by defiition 
"wrong" according to the RRF formula? (or at least: the RRF formula as i 
understand it?)

Let's assume we have 8 docs, and those 8 docs have the following "rankings" for 
two queries Qx and Qy (IIUC, the actual "scores" each doc has against Qx and Qy 
aren't relevant to RRF, just their "rank")....
{noformat}
      Rank(D,Q)     score(RRF(k=0))
Doc   Qx   Qy
D1    1    2           1.5
D2    2    3           0.833
D3    3    4           0.583
D4    4    1           1.25
D5    5    8           0.325
D6    6    7           0.309
D7    7    6           0.309
D8    8    5           0.325
{noformat}
That means, IIUC the "correct" RRF sorting for all docs should be...
{noformat}
D1
D4
D2
D3
D5
 D8 (tie)
D6
 D7 (tie)
{noformat}
...do i have that right?

So now let's assume D1-D4 on are shardA, while D5-D8 are on shardB – that means 
the per-shard Rank(D, Q) for each doc will be different, and the RRF "score" 
computed per shard will be different...
{noformat}
shardA
      Rank(D,Q)     score(RRF(k=0))
Doc   Qx   Qy
D1    1    2           1.5
D2    2    3           0.833
D3    3    4           0.583
D4    4    1           1.25

shardB
      Rank(D,Q)     score(RRF(k=0))
Doc   Qx   Qy
D5    1    4           1.25
D6    2    3           0.833
D7    3    2           0.833
D8    4    1           1.25
{noformat}
...so we'll get a very different/wrong "merged" RRF sorted order based on those 
per-shard scores...
{noformat}
D1
D4
 D5 (tie)
 D8 (tie)
D2
 D6 (tie)
 D7 (tie)
D3
{noformat}
...right?

Obviously this assumes a very extreme case of bias in how the docs are 
distributed across the shards – particularly concerning their relative rank 
against Qx – but even much less extreme distributions of data, the results can 
be widely off.

Consider a situation where the "odd" docIds are on shardO and the "even" docIds 
are on shardE
{noformat}
shardO
      Rank(D,Q)     score(RRF(k=0))
Doc   Qx   Qy
D1    1    1           2
D3    2    2           1
D5    3    4           0.583
D7    4    3           0.583

shardE
      Rank(D,Q)     score(RRF(k=0))
Doc   Qx   Qy
D2    1    2           1.5
D4    2    1           1.5
D6    3    4           0.583
D8    4    3           0.583
{noformat}
... the merged RRF sorted order based on those per-shard scores will still be 
very different then the "expected" value if there was a single shard...
{noformat}
D1
D2 
 D4 (tie)
D3
D5
 D6 (tie)
 D7 (tie)
 D8 (tie)
{noformat}
The bottom line, as i see it, is that since the RRF formula is based on the 
*RANK* of each documents against each query, the only way to accurately compute 
the RRF value for a document requires returning either the scores, or rank 
order, of each document _against every query_ to the coordinator node in order 
to compute the _overall_ rank of that document across all docs from all shards.

> Introduce support for Reciprocal Rank Fusion (combining queries)
> ----------------------------------------------------------------
>
>                 Key: SOLR-17319
>                 URL: https://issues.apache.org/jira/browse/SOLR-17319
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query
>    Affects Versions: 9.6.1
>            Reporter: Alessandro Benedetti
>            Assignee: Alessandro Benedetti
>            Priority: Major
>
> Reciprocal Rank Fusion (RRF) is an algorithm that takes in input multiple 
> ranked lists to produce a unified result set. 
> Examples of use cases where RRF can be used include hybrid search and 
> multiple Knn vector queries executed concurrently. 
> RRF is based on the concept of reciprocal rank, which is the inverse of the 
> rank of a document in a ranked list of search results. 
> The combination of search results happens taking into account the position of
>  the items in the original rankings, and giving higher score to items that 
> are ranked higher in multiple lists. RRF was introduced the first time by 
> Cormack et al. in [1].
> The syntax proposed:
> JSON Request
> {code:json}
> {
>     "queries": {
>         "lexical1": {
>             "lucene": {
>                 "query": "id:(10^=2 OR 2^=1 OR 4^=0.5)"
>             }
>         },
>         "lexical2": {
>             "lucene": {
>                 "query": "id:(2^=2 OR 4^=1 OR 3^=0.5)"
>             }
>         }
>     },
>     "limit": 10,
>     "fields": "[id,score]",
>     "params": {
>         "combiner": true,
>         "combiner.upTo": 5,
>         "facet": true,
>         "facet.field": "id",
>         "facet.mincount": 1
>     }
> }
> {code}
> [1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet 
> and individual rank learning methods.” Proceedings of the 32nd international 
> ACM SIGIR conference on Research and development in information retrieval 
> (2009)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to