Hey Michael, I wrote the first version of my idea about implementing RRF in Lucene, here the link of the code https://gist.github.com/hack4chang/ee2b37eab80bd82e574ff4f94ed204e9. Right now I have some questions, one is about the shardIndex to be returned, another one is the TotalHits value, please take a look at the code and kindly leave some comments below.
Thanks, Hank > On May 18, 2024, at 2:01 PM, Chang Hank <[email protected]> wrote: > > Or maybe we can first create an issue and PR based on the issue number? > WDYT? > > Best, > > Hank > >> On May 18, 2024, at 11:29 AM, Chang Hank <[email protected]> wrote: >> >> Hey Michael, >> >> Sorry I was a bit busy this week, but I’ve looked into the resources you >> provided and also some useful advice from Alessandro and Adrien. >> >> I have a briefly understanding of how RRF works, but I’m not quite sure how >> we should implement it. Based on the advice from Alessandro and Adrien, it >> seems we need to consider that the search results are located at different >> shards. According to Alessandro, we should aggregate the ranked lists from >> all distributed nodes and then apply RRF. >> Are we going to implement this aggregation logic inside our RRF method? >> >> Also could you please create a PR so we can discuss more details further? >> >> All the best, >> >> Hank >> >>> On May 13, 2024, at 10:09 AM, Michael Wechner <[email protected]> >>> wrote: >>> >>> Great, sounds like we have plan :-) >>> >>> Hank and I can get started trying to understand the internals better ... >>> >>> Thanks >>> >>> Michael >>> >>> Am 13.05.24 um 18:21 schrieb Alessandro Benedetti: >>>> Sure, we can make it work but in a distributed environment you have to run >>>> first each query distributed (aggregating all nodes) and then RRF on top >>>> of the aggregated ranked lists. >>>> Doing RRF per node first and then aggregate per shard won't return the >>>> same results I suspect. >>>> When I go back to working on the task I'll be able to elaborate more! >>>> >>>> Cheers >>>> -------------------------- >>>> Alessandro Benedetti >>>> Director @ Sease Ltd. >>>> Apache Lucene/Solr Committer >>>> Apache Solr PMC Member >>>> >>>> e-mail: [email protected] <mailto:[email protected]> >>>> >>>> >>>> Sease - Information Retrieval Applied >>>> Consulting | Training | Open Source >>>> >>>> Website: Sease.io <http://sease.io/> >>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>>> <https://twitter.com/seaseltd> | Youtube >>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>>> <https://github.com/seaseltd> >>>> >>>> On Mon, 13 May 2024 at 14:12, Adrien Grand <[email protected] >>>> <mailto:[email protected]>> wrote: >>>>> > Maybe Adrien Grand and others might also have some feedback :-) >>>>> >>>>> I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int >>>>> topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`. >>>>> Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to >>>>> figure out which hits map to the same document. >>>>> >>>>> > Back in the day, I was reasoning on this and I didn't think Lucene was >>>>> > the right place for an interleaving algorithm, given that Reciprocal >>>>> > Rank Fusion is affected by distribution and it's not supposed to work >>>>> > per node. >>>>> >>>>> To me this is like `TopDocs#merge`. There are changes needed on the >>>>> application side to hook this call into the logic that combines hits that >>>>> come from multiple shards (multiple queries in the case of RRF), but >>>>> Lucene can still provide the merging logic. >>>>> >>>>> On Mon, May 13, 2024 at 1:41 PM Michael Wechner >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>> Thanks for your feedback Alessandro! >>>>>> >>>>>> I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but >>>>>> would like to combine different result sets using RRF, therefore think >>>>>> that Lucene itself could be a good place actually. >>>>>> >>>>>> Looking forward to your additional elaboration! >>>>>> >>>>>> Thanks >>>>>> >>>>>> Michael >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti >>>>>>> <[email protected] <mailto:[email protected]>>: >>>>>>> >>>>>>> This is not strictly related to Lucene, but I'll give a talk at Berlin >>>>>>> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache >>>>>>> Solr. >>>>>>> I'll resume my work on the contribution next week and have more to >>>>>>> share later. >>>>>>> >>>>>>> Back in the day, I was reasoning on this and I didn't think Lucene was >>>>>>> the right place for an interleaving algorithm, given that Reciprocal >>>>>>> Rank Fusion is affected by distribution and it's not supposed to work >>>>>>> per node. >>>>>>> I think I evaluated the possibility of doing it as a Lucene query or a >>>>>>> Lucene component but then ended up with a different approach. >>>>>>> I'll elaborate more when I go back to the task! >>>>>>> >>>>>>> Cheers >>>>>>> -------------------------- >>>>>>> Alessandro Benedetti >>>>>>> Director @ Sease Ltd. >>>>>>> Apache Lucene/Solr Committer >>>>>>> Apache Solr PMC Member >>>>>>> >>>>>>> e-mail: [email protected] <mailto:[email protected]> >>>>>>> >>>>>>> >>>>>>> Sease - Information Retrieval Applied >>>>>>> Consulting | Training | Open Source >>>>>>> >>>>>>> Website: Sease.io <http://sease.io/> >>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>>>>>> <https://twitter.com/seaseltd> | Youtube >>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>>>>>> <https://github.com/seaseltd> >>>>>>> >>>>>>> On Sat, 11 May 2024 at 09:10, Michael Wechner >>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>> sure, no problem! >>>>>>>> >>>>>>>> Maybe Adrien Grand and others might also have some feedback :-) >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Michael >>>>>>>> >>>>>>>> Am 10.05.24 um 23:03 schrieb Chang Hank: >>>>>>>>> Thank you for these useful resources, please allow me to spend some >>>>>>>>> time look into it. >>>>>>>>> I’ll let you know asap!! >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> Hank >>>>>>>>> >>>>>>>>>> On May 10, 2024, at 12:34 PM, Michael Wechner >>>>>>>>>> <[email protected]> <mailto:[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> also we might want to consider how this relates to >>>>>>>>>> >>>>>>>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html >>>>>>>>>> >>>>>>>>>> In vector search reranking has become quite popular, e.g. >>>>>>>>>> >>>>>>>>>> https://docs.cohere.com/docs/reranking >>>>>>>>>> >>>>>>>>>> IIUC LangChain (python) for example adds the reranker as an argument >>>>>>>>>> to the searcher/retriever >>>>>>>>>> >>>>>>>>>> https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/ >>>>>>>>>> >>>>>>>>>> So maybe the following might make sense as well >>>>>>>>>> >>>>>>>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10); >>>>>>>>>> TopDocs topDocsVector = vectorSearcher.search(query, 50, new >>>>>>>>>> CohereReranker()); >>>>>>>>>> >>>>>>>>>> TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword, >>>>>>>>>> topDocsVector); >>>>>>>>>> >>>>>>>>>> WDYT? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> Michael >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Am 10.05.24 um 21:08 schrieb Michael Wechner: >>>>>>>>>>> great, yes, let's get started :-) >>>>>>>>>>> >>>>>>>>>>> What about the following pseudo code, assuming that there might be >>>>>>>>>>> alternative ranking algorithms to RRF >>>>>>>>>>> >>>>>>>>>>> StoredFieldsKeyword storedFieldsKeyword = >>>>>>>>>>> indexReaderKeyword.storedFields(); >>>>>>>>>>> StoredFieldsVector storedFieldsVector = >>>>>>>>>>> indexReaderKeyword.storedFields(); >>>>>>>>>>> >>>>>>>>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10); >>>>>>>>>>> TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50); >>>>>>>>>>> >>>>>>>>>>> Ranker ranker = new RRFRanker(); >>>>>>>>>>> TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, >>>>>>>>>>> topDocsVector); >>>>>>>>>>> >>>>>>>>>>> for (ScoreDoc scoreDoc : topDocs.scoreDocs) { >>>>>>>>>>> Document docK = storedFieldsKeyword.document(scoreDoc.doc); >>>>>>>>>>> Document docV = storedFieldsVector.document(scoreDoc.doc); >>>>>>>>>>> .... >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> whereas also see >>>>>>>>>>> >>>>>>>>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html >>>>>>>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html >>>>>>>>>>> >>>>>>>>>>> WDYT? >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>> Michael >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Am 10.05.24 um 20:01 schrieb Chang Hank: >>>>>>>>>>>> Hi Michael, >>>>>>>>>>>> >>>>>>>>>>>> Sounds good to me. >>>>>>>>>>>> Let’s do it!! >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Hank >>>>>>>>>>>> >>>>>>>>>>>>> On May 10, 2024, at 10:50 AM, Michael Wechner >>>>>>>>>>>>> <[email protected]> <mailto:[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Hank >>>>>>>>>>>>> >>>>>>>>>>>>> Very cool! >>>>>>>>>>>>> >>>>>>>>>>>>> Adrien Grand suggested to implement it as a utility method on >>>>>>>>>>>>> the TopDocs class, and since Adrien worked for a decade on Lucene >>>>>>>>>>>>> https://www.elastic.co/de/blog/author/adrien-grand >>>>>>>>>>>>> I guess it makes sense to follow his advice :-) >>>>>>>>>>>>> >>>>>>>>>>>>> We could create a PR and work together on it, WDYT? >>>>>>>>>>>>> >>>>>>>>>>>>> All the best >>>>>>>>>>>>> >>>>>>>>>>>>> Michael >>>>>>>>>>>>> >>>>>>>>>>>>> Am 10.05.24 um 18:51 schrieb Chang Hank: >>>>>>>>>>>>>> Hi Michael, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you for the reply. >>>>>>>>>>>>>> This is really a cool issue to work on, I’m happy to work on >>>>>>>>>>>>>> this with you. I’ll try to do research on RRF first. >>>>>>>>>>>>>> Also, are we going to implement this on the TopDocs class? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Hank >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On May 9, 2024, at 11:08 PM, Michael Wechner >>>>>>>>>>>>>>> <[email protected]> <mailto:[email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Hank >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for offering your help! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I recently suggested to implement RRF (Reciprocal Rank Fusion) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> but still have not found the time to really work on this. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Maybe you would be interested to do this or that we work on it >>>>>>>>>>>>>>> together somehow? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Michael >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 10.05.24 um 07:27 schrieb Chang Hank: >>>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I’m Hank Chang, currently studying Information Retrieval >>>>>>>>>>>>>>>> topics. I’m really interested in contributing to Apache Lucene >>>>>>>>>>>>>>>> and enhance my understanding to the field. >>>>>>>>>>>>>>>> I’ve reviewed several issues posted on the Github repository >>>>>>>>>>>>>>>> but haven’t found a straightforward starting point. Could >>>>>>>>>>>>>>>> someone please recommend suitable issues for a newcomer like >>>>>>>>>>>>>>>> me or suggest areas I could assist with? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you for your time and guidance. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>> Hank Chang >>>>>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Adrien >>> >> >
