Re: Any recommended issues to work on for a newcomer?

Chang Hank Sat, 18 May 2024 14:02:12 -0700

Or maybe we can first create an issue and PR based on the issue number?
WDYT?


Best,

Hank

> On May 18, 2024, at 11:29 AM, Chang Hank <hackchang0...@gmail.com> wrote:
> 
> Hey Michael, 
> 
> Sorry I was a bit busy this week, but I’ve looked into the resources you 
> provided and also some useful advice from Alessandro and Adrien.
> 
> I have a briefly understanding of how RRF works, but I’m not quite sure how 
> we should implement it. Based on the advice from Alessandro and Adrien, it 
> seems we need to consider that the search results are located at different 
> shards. According to Alessandro, we should aggregate the ranked lists from 
> all distributed nodes and then apply RRF.
> Are we going to implement this aggregation logic inside our RRF method? 
> 
> Also could you please create a PR so we can discuss more details further?
> 
> All the best,
> 
> Hank
> 
>> On May 13, 2024, at 10:09 AM, Michael Wechner <michael.wech...@wyona.com> 
>> wrote:
>> 
>> Great, sounds like we have plan :-)
>> 
>> Hank and I can get started trying to understand the internals better ...
>> 
>> Thanks
>> 
>> Michael
>> 
>> Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:
>>> Sure, we can make it work but in a distributed environment you have to run 
>>> first each query distributed (aggregating all nodes) and then RRF on top of 
>>> the aggregated ranked lists.
>>> Doing RRF per node first and then aggregate per shard won't return the same 
>>> results I suspect.
>>> When I go back to working on the task I'll be able to elaborate more!
>>> 
>>> Cheers
>>> --------------------------
>>> Alessandro Benedetti
>>> Director @ Sease Ltd.
>>> Apache Lucene/Solr Committer
>>> Apache Solr PMC Member
>>> 
>>> e-mail: a.benede...@sease.io <mailto:a.benede...@sease.io>
>>> 
>>> 
>>> Sease - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>> 
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter 
>>> <https://twitter.com/seaseltd> | Youtube 
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github 
>>> <https://github.com/seaseltd>
>>> 
>>> On Mon, 13 May 2024 at 14:12, Adrien Grand <jpou...@gmail.com 
>>> <mailto:jpou...@gmail.com>> wrote:
>>>> > Maybe Adrien Grand and others might also have some feedback :-)
>>>> 
>>>> I'd suggest the signature to look something like `TopDocs TopDocs#rrf(int 
>>>> topN, int k, TopDocs[] hits)` to be consistent with `TopDocs#merge`. 
>>>> Internally, it should look at `ScoreDoc#shardId` and `ScoreDoc#doc` to 
>>>> figure out which hits map to the same document.
>>>> 
>>>> > Back in the day, I was reasoning on this and I didn't think Lucene was 
>>>> > the right place for an interleaving algorithm, given that Reciprocal 
>>>> > Rank Fusion is affected by distribution and it's not supposed to work 
>>>> > per node.
>>>> 
>>>> To me this is like `TopDocs#merge`. There are changes needed on the 
>>>> application side to hook this call into the logic that combines hits that 
>>>> come from multiple shards (multiple queries in the case of RRF), but 
>>>> Lucene can still provide the merging logic.
>>>> 
>>>> On Mon, May 13, 2024 at 1:41 PM Michael Wechner <michael.wech...@wyona.com 
>>>> <mailto:michael.wech...@wyona.com>> wrote:
>>>>> Thanks for your feedback Alessandro!
>>>>> 
>>>>> I am using Lucene independent of Solr or OpenSearch, Elasticsearch, but 
>>>>> would like to combine different result sets using RRF, therefore think 
>>>>> that Lucene itself could be a good place actually.
>>>>> 
>>>>> Looking forward to your additional elaboration!
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Michael
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti 
>>>>>> <a.benede...@sease.io <mailto:a.benede...@sease.io>>:
>>>>>> 
>>>>>> This is not strictly related to Lucene, but I'll give a talk at Berlin 
>>>>>> Buzzwords on how I am implementing Reciprocal Rank Fusion in Apache Solr.
>>>>>> I'll resume my work on the contribution next week and have more to share 
>>>>>> later.
>>>>>> 
>>>>>> Back in the day, I was reasoning on this and I didn't think Lucene was 
>>>>>> the right place for an interleaving algorithm, given that Reciprocal 
>>>>>> Rank Fusion is affected by distribution and it's not supposed to work 
>>>>>> per node.
>>>>>> I think I evaluated the possibility of doing it as a Lucene query or a 
>>>>>> Lucene component but then ended up with a different approach.
>>>>>> I'll elaborate more when I go back to the task!
>>>>>> 
>>>>>> Cheers
>>>>>> --------------------------
>>>>>> Alessandro Benedetti
>>>>>> Director @ Sease Ltd.
>>>>>> Apache Lucene/Solr Committer
>>>>>> Apache Solr PMC Member
>>>>>> 
>>>>>> e-mail: a.benede...@sease.io <mailto:a.benede...@sease.io>
>>>>>> 
>>>>>> 
>>>>>> Sease - Information Retrieval Applied
>>>>>> Consulting | Training | Open Source
>>>>>> 
>>>>>> Website: Sease.io <http://sease.io/>
>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter 
>>>>>> <https://twitter.com/seaseltd> | Youtube 
>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github 
>>>>>> <https://github.com/seaseltd>
>>>>>> 
>>>>>> On Sat, 11 May 2024 at 09:10, Michael Wechner <michael.wech...@wyona.com 
>>>>>> <mailto:michael.wech...@wyona.com>> wrote:
>>>>>>> sure, no problem!
>>>>>>> 
>>>>>>> Maybe Adrien Grand and others might also have some feedback :-)
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> Michael
>>>>>>> 
>>>>>>> Am 10.05.24 um 23:03 schrieb Chang Hank:
>>>>>>>> Thank you for these useful resources, please allow me to spend some 
>>>>>>>> time look into it. 
>>>>>>>> I’ll let you know asap!!
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> Hank
>>>>>>>> 
>>>>>>>>> On May 10, 2024, at 12:34 PM, Michael Wechner 
>>>>>>>>> <michael.wech...@wyona.com> <mailto:michael.wech...@wyona.com> wrote:
>>>>>>>>> 
>>>>>>>>> also we might want to consider how this relates to
>>>>>>>>> 
>>>>>>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html
>>>>>>>>> 
>>>>>>>>> In vector search reranking has become quite popular, e.g.
>>>>>>>>> 
>>>>>>>>> https://docs.cohere.com/docs/reranking
>>>>>>>>> 
>>>>>>>>> IIUC LangChain (python) for example adds the reranker as an argument 
>>>>>>>>> to the searcher/retriever
>>>>>>>>> 
>>>>>>>>> https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/
>>>>>>>>> 
>>>>>>>>> So maybe the following might make sense as well
>>>>>>>>> 
>>>>>>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>>>>>>>>> TopDocs topDocsVector = vectorSearcher.search(query, 50, new 
>>>>>>>>> CohereReranker());
>>>>>>>>> 
>>>>>>>>> TopDocs topDocs = TopDocs.merge(new RRFRanker(), topDocsKeyword, 
>>>>>>>>> topDocsVector);
>>>>>>>>> 
>>>>>>>>> WDYT?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> Michael
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Am 10.05.24 um 21:08 schrieb Michael Wechner:
>>>>>>>>>> great, yes, let's get started :-)
>>>>>>>>>> 
>>>>>>>>>> What about the following pseudo code, assuming that there might be 
>>>>>>>>>> alternative ranking algorithms to RRF
>>>>>>>>>> 
>>>>>>>>>> StoredFieldsKeyword storedFieldsKeyword = 
>>>>>>>>>> indexReaderKeyword.storedFields();
>>>>>>>>>> StoredFieldsVector storedFieldsVector = 
>>>>>>>>>> indexReaderKeyword.storedFields();
>>>>>>>>>> 
>>>>>>>>>> TopDocs topDocsKeyword = keywordSearcher.search(keywordQuery, 10);
>>>>>>>>>> TopDocs topDocsVector = vectorSearcher.search(vectorQuery, 50);
>>>>>>>>>> 
>>>>>>>>>> Ranker ranker = new RRFRanker();
>>>>>>>>>> TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword, 
>>>>>>>>>> topDocsVector);
>>>>>>>>>> 
>>>>>>>>>> for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>>>>>>>>>>     Document docK = storedFieldsKeyword.document(scoreDoc.doc);
>>>>>>>>>>     Document docV = storedFieldsVector.document(scoreDoc.doc);
>>>>>>>>>>     ....
>>>>>>>>>> } 
>>>>>>>>>> 
>>>>>>>>>> whereas also see 
>>>>>>>>>> 
>>>>>>>>>> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
>>>>>>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html
>>>>>>>>>> 
>>>>>>>>>> WDYT?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>> Michael
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Am 10.05.24 um 20:01 schrieb Chang Hank:
>>>>>>>>>>> Hi Michael,
>>>>>>>>>>> 
>>>>>>>>>>> Sounds good to me. 
>>>>>>>>>>> Let’s do it!!
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Hank
>>>>>>>>>>> 
>>>>>>>>>>>> On May 10, 2024, at 10:50 AM, Michael Wechner 
>>>>>>>>>>>> <michael.wech...@wyona.com> <mailto:michael.wech...@wyona.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Hank
>>>>>>>>>>>> 
>>>>>>>>>>>> Very cool!
>>>>>>>>>>>> 
>>>>>>>>>>>> Adrien Grand suggested to implement it as  a utility method on the 
>>>>>>>>>>>> TopDocs class, and since Adrien worked for a decade on Lucene
>>>>>>>>>>>> https://www.elastic.co/de/blog/author/adrien-grand
>>>>>>>>>>>> I guess it makes sense to follow his advice :-)
>>>>>>>>>>>> 
>>>>>>>>>>>> We could create a PR and work together on it, WDYT?
>>>>>>>>>>>> 
>>>>>>>>>>>> All the best
>>>>>>>>>>>> 
>>>>>>>>>>>> Michael
>>>>>>>>>>>> 
>>>>>>>>>>>> Am 10.05.24 um 18:51 schrieb Chang Hank:
>>>>>>>>>>>>> Hi Michael, 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you for the reply.
>>>>>>>>>>>>> This is really a cool issue to work on,  I’m happy to work on 
>>>>>>>>>>>>> this with you. I’ll try to do research on RRF first.
>>>>>>>>>>>>> Also, are we going to implement this on the TopDocs class?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Hank
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On May 9, 2024, at 11:08 PM, Michael Wechner 
>>>>>>>>>>>>>> <michael.wech...@wyona.com> <mailto:michael.wech...@wyona.com> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Hank
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for offering your help!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I recently suggested to implement RRF (Reciprocal Rank Fusion)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> but still have not found the time to really work on this.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Maybe you would be interested to do this or that we work on it 
>>>>>>>>>>>>>> together somehow?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Michael
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Am 10.05.24 um 07:27 schrieb Chang Hank:
>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I’m Hank Chang, currently studying Information Retrieval 
>>>>>>>>>>>>>>> topics. I’m really interested in contributing to Apache Lucene 
>>>>>>>>>>>>>>> and enhance my understanding to the field.
>>>>>>>>>>>>>>> I’ve reviewed several issues posted on the Github repository 
>>>>>>>>>>>>>>> but haven’t found a straightforward starting point. Could 
>>>>>>>>>>>>>>> someone please recommend suitable issues for a newcomer like me 
>>>>>>>>>>>>>>> or suggest areas I could assist with?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you for your time and guidance.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Hank Chang
>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>>>>>>>>>>>>>>> <mailto:dev-unsubscr...@lucene.apache.org>
>>>>>>>>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>>>>>>>>>>>>>>> <mailto:dev-h...@lucene.apache.org>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>>>>>>>>>>>>>> <mailto:dev-unsubscr...@lucene.apache.org>
>>>>>>>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>>>>>>>>>>>>>> <mailto:dev-h...@lucene.apache.org>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Adrien
>> 
>

Re: Any recommended issues to work on for a newcomer?

Reply via email to