Re: Any recommended issues to work on for a newcomer?

Chang Hank Sat, 18 May 2024 16:42:51 -0700

Hey Michael,

I wrote the first version of my idea about implementing RRF in Lucene, here the 
link of the code 
https://gist.github.com/hack4chang/ee2b37eab80bd82e574ff4f94ed204e9.
Right now I have some questions, one is about the shardIndex to be returned, 
another one is the TotalHits value, please take a look at the code and kindly 
leave some comments below.


Thanks,
Hank

> On May 18, 2024, at 2:01 PM, Chang Hank <hackchang0...@gmail.com> wrote:
> 
> Or maybe we can first create an issue and PR based on the issue number?
> WDYT?
> 
> Best,
> 
> Hank
> 
>> On May 18, 2024, at 11:29 AM, Chang Hank <hackchang0...@gmail.com> wrote:
>> 
>> Hey Michael, 
>> 
>> Sorry I was a bit busy this week, but I’ve looked into the resources you 
>> provided and also some useful advice from Alessandro and Adrien.
>> 
>> I have a briefly understanding of how RRF works, but I’m not quite sure how 
>> we should implement it. Based on the advice from Alessandro and Adrien, it 
>> seems we need to consider that the search results are located at different 
>> shards. According to Alessandro, we should aggregate the ranked lists from 
>> all distributed nodes and then apply RRF.
>> Are we going to implement this aggregation logic inside our RRF method? 
>> 
>> Also could you please create a PR so we can discuss more details further?
>> 
>> All the best,
>> 
>> Hank
>> 
>>> On May 13, 2024, at 10:09 AM, Michael Wechner <michael.wech...@wyona.com> 
>>> wrote:
>>> 
>>> Great, sounds like we have plan :-)
>>> 
>>> Hank and I can get started trying to understand the internals better ...
>>> 
>>> Thanks
>>> 
>>> Michael
>>> 
>>> Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:
>>>> Sure, we can make it work but in a distributed environment you have to run 
>>>> first each query distributed (aggregating all nodes) and then RRF on top 
>>>> of the aggregated ranked lists.
>>>> Doing RRF per node first and then aggregate per shard won't return the 
>>>> same results I suspect.
>>>> When I go back to working on the task I'll be able to elaborate more!
>>>> 
>>>> Cheers
>>>> --------------------------
>>>> Alessandro Benedetti
>>>> Director @ Sease Ltd.
>>>> Apache Lucene/Solr Committer
>>>> Apache Solr PMC Member
>>>> 
>>>> e-mail: a.benede...@sease.io <mailto:a.benede...@sease.io>
>>>> 
>>>> 
>>>> Sease - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>> 
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter 
>>>> <https://twitter.com/seaseltd> | Youtube 
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github 
>>>> <https://github.com/seaseltd>
>>>> 
>>>> On Mon, 13 May 2024 at 14:12, Adrien Grand <jpou...@gmail.com 
>>>> <mailto:jpou...@gmail.com>> wrote:
>>>>> > Maybe Adrien Grand and others might also have some feedback :-)
>>>>> 
>>>>> I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int 
>>>>> topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`. 
>>>>> Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to 
>>>>> figure out which hits map to the same document.
>>>>> 
>>>>> > Back in the day, I was reasoning on this and I didn't think Lucene was 
>>>>> > the right place for an interleaving algorithm, given that Reciprocal 
>>>>> > Rank Fusion is affected by distribution and it's not supposed to work 
>>>>> > per node.
>>>>> 
>>>>> To me this is like `TopDocs#merge`. There are changes needed on the 
>>>>> application side to hook this call into the logic that combines hits that 
>>>>> come from multiple shards (multiple queries in the case of RRF), but 
>>>>> Lucene can still provide the merging logic.
>>>>> 
>>>>> On Mon, May 13, 2024 at 1:41 PM Michael Wechner 
>>>>> <michael.wech...@wyona.com <mailto:michael.wech...@wyona.com>> wrote:
>>>>>> Thanks for your feedback Alessandro!
>>>>>> 
>>>>>> I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but 
>>>>>> would like to combine different result sets using RRF, therefore think 
>>>>>> that Lucene itself could be a good place actually.
>>>>>> 
>>>>>> Looking forward to your additional elaboration!
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> Michael
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti 
>>>>>>> <a.benede...@sease.io <mailto:a.benede...@sease.io>>:
>>>>>>> 
>>>>>>> This is not strictly related to Lucene, but I'll give a talk at Berlin 
>>>>>>> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache 
>>>>>>> Solr.
>>>>>>> I'll resume my work on the contribution next week and have more to 
>>>>>>> share later.
>>>>>>> 
>>>>>>> Back in the day, I was reasoning on this and I didn't think Lucene was 
>>>>>>> the right place for an interleaving algorithm, given that Reciprocal 
>>>>>>> Rank Fusion is affected by distribution and it's not supposed to work 
>>>>>>> per node.
>>>>>>> I think I evaluated the possibility of doing it as a Lucene query or a 
>>>>>>> Lucene component but then ended up with a different approach.
>>>>>>> I'll elaborate more when I go back to the task!
>>>>>>> 
>>>>>>> Cheers
>>>>>>> --------------------------
>>>>>>> Alessandro Benedetti
>>>>>>> Director @ Sease Ltd.
>>>>>>> Apache Lucene/Solr Committer
>>>>>>> Apache Solr PMC Member
>>>>>>> 
>>>>>>> e-mail: a.benede...@sease.io <mailto:a.benede...@sease.io>
>>>>>>> 
>>>>>>> 
>>>>>>> Sease - Information Retrieval Applied
>>>>>>> Consulting | Training | Open Source
>>>>>>> 
>>>>>>> Website: Sease.io <http://sease.io/>
>>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter 
>>>>>>> <https://twitter.com/seaseltd> | Youtube 
>>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github 
>>>>>>> <https://github.com/seaseltd>
>>>>>>> 
>>>>>>> On Sat, 11 May 2024 at 09:10, Michael Wechner 
>>>>>>> <michael.wech...@wyona.com <mailto:michael.wech...@wyona.com>> wrote:
>>>>>>>> sure, no problem!
>>>>>>>> 
>>>>>>>> Maybe Adrien Grand and others might also have some feedback :-)
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> Michael
>>>>>>>> 
>>>>>>>> Am 10.05.24 um 23:03 schrieb Chang Hank:
>>>>>>>>> Thank you for these useful resources, please allow me to spend some 
>>>>>>>>> time look into it. 
>>>>>>>>> I’ll let you know asap!!
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> Hank
>>>>>>>>> 
>>>>>>>>>> On May 10, 2024, at 12:34 PM, Michael Wechner 
>>>>>>>>>> <michael.wech...@wyona.com> <mailto:michael.wech...@wyona.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> also we might want to consider how this relates to
>>>>>>>>>> 
>>>>>>>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html
>>>>>>>>>> 
>>>>>>>>>> In vector search reranking has become quite popular, e.g.
>>>>>>>>>> 
>>>>>>>>>> https://docs.cohere.com/docs/reranking
>>>>>>>>>> 
>>>>>>>>>> IIUC LangChain (python) for example adds the reranker as an argument 
>>>>>>>>>> to the searcher/retriever
>>>>>>>>>> 
>>>>>>>>>> https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/
>>>>>>>>>> 
>>>>>>>>>> So maybe the following might make sense as well
>>>>>>>>>> 
>>>>>>>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>>>>>>>>>> TopDocs topDocsVector = vectorSearcher.search(query, 50, new 
>>>>>>>>>> CohereReranker());
>>>>>>>>>> 
>>>>>>>>>> TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword, 
>>>>>>>>>> topDocsVector);
>>>>>>>>>> 
>>>>>>>>>> WDYT?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>> Michael
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Am 10.05.24 um 21:08 schrieb Michael Wechner:
>>>>>>>>>>> great, yes, let's get started :-)
>>>>>>>>>>> 
>>>>>>>>>>> What about the following pseudo code, assuming that there might be 
>>>>>>>>>>> alternative ranking algorithms to RRF
>>>>>>>>>>> 
>>>>>>>>>>> StoredFieldsKeyword storedFieldsKeyword = 
>>>>>>>>>>> indexReaderKeyword.storedFields();
>>>>>>>>>>> StoredFieldsVector storedFieldsVector = 
>>>>>>>>>>> indexReaderKeyword.storedFields();
>>>>>>>>>>> 
>>>>>>>>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>>>>>>>>>>> TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50);
>>>>>>>>>>> 
>>>>>>>>>>> Ranker ranker = new RRFRanker();
>>>>>>>>>>> TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, 
>>>>>>>>>>> topDocsVector);
>>>>>>>>>>> 
>>>>>>>>>>> for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>>>>>>>>>>>     Document docK = storedFieldsKeyword.document(scoreDoc.doc);
>>>>>>>>>>>     Document docV = storedFieldsVector.document(scoreDoc.doc);
>>>>>>>>>>>     ....
>>>>>>>>>>> } 
>>>>>>>>>>> 
>>>>>>>>>>> whereas also see 
>>>>>>>>>>> 
>>>>>>>>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
>>>>>>>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html
>>>>>>>>>>> 
>>>>>>>>>>> WDYT?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>> Michael
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Am 10.05.24 um 20:01 schrieb Chang Hank:
>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>> 
>>>>>>>>>>>> Sounds good to me. 
>>>>>>>>>>>> Let’s do it!!
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Hank
>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 10, 2024, at 10:50 AM, Michael Wechner 
>>>>>>>>>>>>> <michael.wech...@wyona.com> <mailto:michael.wech...@wyona.com> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Hank
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Very cool!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Adrien Grand suggested to implement it as  a utility method on 
>>>>>>>>>>>>> the TopDocs class, and since Adrien worked for a decade on Lucene
>>>>>>>>>>>>> https://www.elastic.co/de/blog/author/adrien-grand
>>>>>>>>>>>>> I guess it makes sense to follow his advice :-)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We could create a PR and work together on it, WDYT?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> All the best
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Michael
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Am 10.05.24 um 18:51 schrieb Chang Hank:
>>>>>>>>>>>>>> Hi Michael, 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thank you for the reply.
>>>>>>>>>>>>>> This is really a cool issue to work on,  I’m happy to work on 
>>>>>>>>>>>>>> this with you. I’ll try to do research on RRF first.
>>>>>>>>>>>>>> Also, are we going to implement this on the TopDocs class?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Hank
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On May 9, 2024, at 11:08 PM, Michael Wechner 
>>>>>>>>>>>>>>> <michael.wech...@wyona.com> <mailto:michael.wech...@wyona.com> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Hank
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for offering your help!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I recently suggested to implement RRF (Reciprocal Rank Fusion)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> but still have not found the time to really work on this.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Maybe you would be interested to do this or that we work on it 
>>>>>>>>>>>>>>> together somehow?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Michael
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Am 10.05.24 um 07:27 schrieb Chang Hank:
>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I’m Hank Chang, currently studying Information Retrieval 
>>>>>>>>>>>>>>>> topics. I’m really interested in contributing to Apache Lucene 
>>>>>>>>>>>>>>>> and enhance my understanding to the field.
>>>>>>>>>>>>>>>> I’ve reviewed several issues posted on the Github repository 
>>>>>>>>>>>>>>>> but haven’t found a straightforward starting point. Could 
>>>>>>>>>>>>>>>> someone please recommend suitable issues for a newcomer like 
>>>>>>>>>>>>>>>> me or suggest areas I could assist with?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thank you for your time and guidance.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> Hank Chang
>>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>>>>>>>>>>>>>>>> <mailto:dev-unsubscr...@lucene.apache.org>
>>>>>>>>>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>>>>>>>>>>>>>>>> <mailto:dev-h...@lucene.apache.org>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>>>>>>>>>>>>>>> <mailto:dev-unsubscr...@lucene.apache.org>
>>>>>>>>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>>>>>>>>>>>>>>> <mailto:dev-h...@lucene.apache.org>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Adrien
>>> 
>> 
>

Re: Any recommended issues to work on for a newcomer?

Reply via email to