Or maybe we can first create an issue and PR based on the issue number? WDYT?
Best, Hank > On May 18, 2024, at 11:29 AM, Chang Hank <hackchang0...@gmail.com> wrote: > > Hey Michael, > > Sorry I was a bit busy this week, but I’ve looked into the resources you > provided and also some useful advice from Alessandro and Adrien. > > I have a briefly understanding of how RRF works, but I’m not quite sure how > we should implement it. Based on the advice from Alessandro and Adrien, it > seems we need to consider that the search results are located at different > shards. According to Alessandro, we should aggregate the ranked lists from > all distributed nodes and then apply RRF. > Are we going to implement this aggregation logic inside our RRF method? > > Also could you please create a PR so we can discuss more details further? > > All the best, > > Hank > >> On May 13, 2024, at 10:09 AM, Michael Wechner <michael.wech...@wyona.com> >> wrote: >> >> Great, sounds like we have plan :-) >> >> Hank and I can get started trying to understand the internals better ... >> >> Thanks >> >> Michael >> >> Am 13.05.24 um 18:21 schrieb Alessandro Benedetti: >>> Sure, we can make it work but in a distributed environment you have to run >>> first each query distributed (aggregating all nodes) and then RRF on top of >>> the aggregated ranked lists. >>> Doing RRF per node first and then aggregate per shard won't return the same >>> results I suspect. >>> When I go back to working on the task I'll be able to elaborate more! >>> >>> Cheers >>> -------------------------- >>> Alessandro Benedetti >>> Director @ Sease Ltd. >>> Apache Lucene/Solr Committer >>> Apache Solr PMC Member >>> >>> e-mail: a.benede...@sease.io <mailto:a.benede...@sease.io> >>> >>> >>> Sease - Information Retrieval Applied >>> Consulting | Training | Open Source >>> >>> Website: Sease.io <http://sease.io/> >>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>> <https://twitter.com/seaseltd> | Youtube >>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>> <https://github.com/seaseltd> >>> >>> On Mon, 13 May 2024 at 14:12, Adrien Grand <jpou...@gmail.com >>> <mailto:jpou...@gmail.com>> wrote: >>>> > Maybe Adrien Grand and others might also have some feedback :-) >>>> >>>> I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int >>>> topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`. >>>> Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to >>>> figure out which hits map to the same document. >>>> >>>> > Back in the day, I was reasoning on this and I didn't think Lucene was >>>> > the right place for an interleaving algorithm, given that Reciprocal >>>> > Rank Fusion is affected by distribution and it's not supposed to work >>>> > per node. >>>> >>>> To me this is like `TopDocs#merge`. There are changes needed on the >>>> application side to hook this call into the logic that combines hits that >>>> come from multiple shards (multiple queries in the case of RRF), but >>>> Lucene can still provide the merging logic. >>>> >>>> On Mon, May 13, 2024 at 1:41 PM Michael Wechner <michael.wech...@wyona.com >>>> <mailto:michael.wech...@wyona.com>> wrote: >>>>> Thanks for your feedback Alessandro! >>>>> >>>>> I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but >>>>> would like to combine different result sets using RRF, therefore think >>>>> that Lucene itself could be a good place actually. >>>>> >>>>> Looking forward to your additional elaboration! >>>>> >>>>> Thanks >>>>> >>>>> Michael >>>>> >>>>> >>>>> >>>>> >>>>>> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti >>>>>> <a.benede...@sease.io <mailto:a.benede...@sease.io>>: >>>>>> >>>>>> This is not strictly related to Lucene, but I'll give a talk at Berlin >>>>>> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache Solr. >>>>>> I'll resume my work on the contribution next week and have more to share >>>>>> later. >>>>>> >>>>>> Back in the day, I was reasoning on this and I didn't think Lucene was >>>>>> the right place for an interleaving algorithm, given that Reciprocal >>>>>> Rank Fusion is affected by distribution and it's not supposed to work >>>>>> per node. >>>>>> I think I evaluated the possibility of doing it as a Lucene query or a >>>>>> Lucene component but then ended up with a different approach. >>>>>> I'll elaborate more when I go back to the task! >>>>>> >>>>>> Cheers >>>>>> -------------------------- >>>>>> Alessandro Benedetti >>>>>> Director @ Sease Ltd. >>>>>> Apache Lucene/Solr Committer >>>>>> Apache Solr PMC Member >>>>>> >>>>>> e-mail: a.benede...@sease.io <mailto:a.benede...@sease.io> >>>>>> >>>>>> >>>>>> Sease - Information Retrieval Applied >>>>>> Consulting | Training | Open Source >>>>>> >>>>>> Website: Sease.io <http://sease.io/> >>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>>>>> <https://twitter.com/seaseltd> | Youtube >>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>>>>> <https://github.com/seaseltd> >>>>>> >>>>>> On Sat, 11 May 2024 at 09:10, Michael Wechner <michael.wech...@wyona.com >>>>>> <mailto:michael.wech...@wyona.com>> wrote: >>>>>>> sure, no problem! >>>>>>> >>>>>>> Maybe Adrien Grand and others might also have some feedback :-) >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> Am 10.05.24 um 23:03 schrieb Chang Hank: >>>>>>>> Thank you for these useful resources, please allow me to spend some >>>>>>>> time look into it. >>>>>>>> I’ll let you know asap!! >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> Hank >>>>>>>> >>>>>>>>> On May 10, 2024, at 12:34 PM, Michael Wechner >>>>>>>>> <michael.wech...@wyona.com> <mailto:michael.wech...@wyona.com> wrote: >>>>>>>>> >>>>>>>>> also we might want to consider how this relates to >>>>>>>>> >>>>>>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html >>>>>>>>> >>>>>>>>> In vector search reranking has become quite popular, e.g. >>>>>>>>> >>>>>>>>> https://docs.cohere.com/docs/reranking >>>>>>>>> >>>>>>>>> IIUC LangChain (python) for example adds the reranker as an argument >>>>>>>>> to the searcher/retriever >>>>>>>>> >>>>>>>>> https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/ >>>>>>>>> >>>>>>>>> So maybe the following might make sense as well >>>>>>>>> >>>>>>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10); >>>>>>>>> TopDocs topDocsVector = vectorSearcher.search(query, 50, new >>>>>>>>> CohereReranker()); >>>>>>>>> >>>>>>>>> TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword, >>>>>>>>> topDocsVector); >>>>>>>>> >>>>>>>>> WDYT? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> Michael >>>>>>>>> >>>>>>>>> >>>>>>>>> Am 10.05.24 um 21:08 schrieb Michael Wechner: >>>>>>>>>> great, yes, let's get started :-) >>>>>>>>>> >>>>>>>>>> What about the following pseudo code, assuming that there might be >>>>>>>>>> alternative ranking algorithms to RRF >>>>>>>>>> >>>>>>>>>> StoredFieldsKeyword storedFieldsKeyword = >>>>>>>>>> indexReaderKeyword.storedFields(); >>>>>>>>>> StoredFieldsVector storedFieldsVector = >>>>>>>>>> indexReaderKeyword.storedFields(); >>>>>>>>>> >>>>>>>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10); >>>>>>>>>> TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50); >>>>>>>>>> >>>>>>>>>> Ranker ranker = new RRFRanker(); >>>>>>>>>> TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, >>>>>>>>>> topDocsVector); >>>>>>>>>> >>>>>>>>>> for (ScoreDoc scoreDoc : topDocs.scoreDocs) { >>>>>>>>>> Document docK = storedFieldsKeyword.document(scoreDoc.doc); >>>>>>>>>> Document docV = storedFieldsVector.document(scoreDoc.doc); >>>>>>>>>> .... >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> whereas also see >>>>>>>>>> >>>>>>>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html >>>>>>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html >>>>>>>>>> >>>>>>>>>> WDYT? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> Michael >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Am 10.05.24 um 20:01 schrieb Chang Hank: >>>>>>>>>>> Hi Michael, >>>>>>>>>>> >>>>>>>>>>> Sounds good to me. >>>>>>>>>>> Let’s do it!! >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Hank >>>>>>>>>>> >>>>>>>>>>>> On May 10, 2024, at 10:50 AM, Michael Wechner >>>>>>>>>>>> <michael.wech...@wyona.com> <mailto:michael.wech...@wyona.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Hank >>>>>>>>>>>> >>>>>>>>>>>> Very cool! >>>>>>>>>>>> >>>>>>>>>>>> Adrien Grand suggested to implement it as a utility method on the >>>>>>>>>>>> TopDocs class, and since Adrien worked for a decade on Lucene >>>>>>>>>>>> https://www.elastic.co/de/blog/author/adrien-grand >>>>>>>>>>>> I guess it makes sense to follow his advice :-) >>>>>>>>>>>> >>>>>>>>>>>> We could create a PR and work together on it, WDYT? >>>>>>>>>>>> >>>>>>>>>>>> All the best >>>>>>>>>>>> >>>>>>>>>>>> Michael >>>>>>>>>>>> >>>>>>>>>>>> Am 10.05.24 um 18:51 schrieb Chang Hank: >>>>>>>>>>>>> Hi Michael, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for the reply. >>>>>>>>>>>>> This is really a cool issue to work on, I’m happy to work on >>>>>>>>>>>>> this with you. I’ll try to do research on RRF first. >>>>>>>>>>>>> Also, are we going to implement this on the TopDocs class? >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Hank >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On May 9, 2024, at 11:08 PM, Michael Wechner >>>>>>>>>>>>>> <michael.wech...@wyona.com> <mailto:michael.wech...@wyona.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Hank >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for offering your help! >>>>>>>>>>>>>> >>>>>>>>>>>>>> I recently suggested to implement RRF (Reciprocal Rank Fusion) >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz >>>>>>>>>>>>>> >>>>>>>>>>>>>> but still have not found the time to really work on this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Maybe you would be interested to do this or that we work on it >>>>>>>>>>>>>> together somehow? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> >>>>>>>>>>>>>> Michael >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am 10.05.24 um 07:27 schrieb Chang Hank: >>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’m Hank Chang, currently studying Information Retrieval >>>>>>>>>>>>>>> topics. I’m really interested in contributing to Apache Lucene >>>>>>>>>>>>>>> and enhance my understanding to the field. >>>>>>>>>>>>>>> I’ve reviewed several issues posted on the Github repository >>>>>>>>>>>>>>> but haven’t found a straightforward starting point. Could >>>>>>>>>>>>>>> someone please recommend suitable issues for a newcomer like me >>>>>>>>>>>>>>> or suggest areas I could assist with? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you for your time and guidance. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> Hank Chang >>>>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>>>>>>>>>>> <mailto:dev-unsubscr...@lucene.apache.org> >>>>>>>>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>>>>>>>>>>> <mailto:dev-h...@lucene.apache.org> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>>>>>>>>>> <mailto:dev-unsubscr...@lucene.apache.org> >>>>>>>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>>>>>>>>>> <mailto:dev-h...@lucene.apache.org> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>>> >>>> -- >>>> Adrien >> >