Re: Any recommended issues to work on for a newcomer?

Michael Wechner Sat, 22 Jun 2024 15:21:28 -0700

Hi Hank

Sorry, I still did not find the time to try your code, but learned todayabout


https://rockset.com/Rockset_for_Hybrid_Search.pdf
https://rockset.com/whitepapers/hybrid-search-architecture/

which might be interesting to compare with.

Thanks

Michael



Am 20.05.24 um 08:16 schrieb Michael Wechner:

Hi Hank

Very cool, thank you, will try to do this asap!

All the best

Michael


Am 19.05.24 um 01:42 schrieb Chang Hank:

Hey Michael,

I wrote the first version of my idea about implementing RRF inLucene, here the link of the codehttps://gist.github.com/hack4chang/ee2b37eab80bd82e574ff4f94ed204e9.Right now I have some questions, one is about the shardIndex to bereturned, another one is the TotalHits value, please take a look atthe code and kindly leave some comments below.


Thanks,
Hank

On May 18, 2024, at 2:01 PM, Chang Hank <[email protected]> wrote:

Or maybe we can first create an issue and PR based on the issue number?
WDYT?

Best,

Hank

On May 18, 2024, at 11:29 AM, Chang Hank <[email protected]>wrote:


Hey Michael,

Sorry I was a bit busy this week, but I’ve looked into theresources you provided and also some useful advice from Alessandroand Adrien.

I have a briefly understanding of how RRF works, but I’m not quitesure how we should implement it. Based on the advice fromAlessandro and Adrien, it seems we need to consider that the searchresults are located at different shards. According to Alessandro,we should aggregate the ranked lists from all distributed nodes andthen apply RRF.Are we going to implement this aggregation logic inside our RRFmethod?

Also could you please create a PR so we can discuss more detailsfurther?


All the best,

Hank

On May 13, 2024, at 10:09 AM, Michael Wechner<[email protected]> wrote:


Great, sounds like we have plan :-)

Hank and I can get started trying to understand the internalsbetter ...


Thanks

Michael

Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:

Sure, we can make it work but in a distributed environment youhave to run first each query distributed (aggregating all nodes)and then RRF on top of the aggregated ranked lists.Doing RRF per node first and then aggregate per shard won'treturn the same results I suspect.

When I go back to working on the task I'll be able to elaborate more!

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: [email protected]/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>

LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter<https://twitter.com/seaseltd> | Youtube<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |Github <https://github.com/seaseltd>



On Mon, 13 May 2024 at 14:12, Adrien Grand <[email protected]> wrote:

    > Maybe Adrien Grand and others might also have some feedback
    :-)

    I'd suggest the signature to look something like `TopDocs
    TopDocs#rrf(int topN, int k, TopDocs[] hits)` to be
    consistent with `TopDocs#merge`. Internally, it should look
    at `ScoreDoc#shardId` and `ScoreDoc#doc` to figure out which
    hits map to the same document.

    > Back in the day, I was reasoning on this and I didn't think
    Lucene was the right place for an interleaving algorithm,
    given that Reciprocal Rank Fusion is affected by distribution
    and it's not supposed to work per node.

    To me this is like `TopDocs#merge`. There are changes needed
    on the application side to hook this call into the logic that
    combines hits that come from multiple shards (multiple
    queries in the case of RRF), but Lucene can still provide the
    merging logic.

    On Mon, May 13, 2024 at 1:41 PM Michael Wechner
    <[email protected]> wrote:

        Thanks for your feedback Alessandro!

        I am using Lucene independent of Solr or OpenSearch,
        Elasticsearch, but would like to combine different result
        sets using RRF, therefore think that Lucene itself could
        be a good place actually.

        Looking forward to your additional elaboration!

        Thanks

        Michael

        Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti
        <[email protected]>:

        This is not strictly related to Lucene, but I'll give a
        talk at Berlin Buzzwords on how I am implementing
        Reciprocal Rank Fusion in Apache Solr.
        I'll resume my work on the contribution next week and
        have more to share later.

        Back in the day, I was reasoning on this and I didn't
        think Lucene was the right place for an interleaving
        algorithm, given that Reciprocal Rank Fusion is affected
        by distribution and it's not supposed to work per node.
        I think I evaluated the possibility of doing it as a
        Lucene query or a Lucene component but then ended up
        with a different approach.
        I'll elaborate more when I go back to the task!

        Cheers
        --------------------------
        *Alessandro Benedetti*
        Director @ Sease Ltd.
        /Apache Lucene/Solr Committer/
        /Apache Solr PMC Member/

        e-mail: [email protected]/
        /

        *Sease* - Information Retrieval Applied
        Consulting | Training | Open Source

        Website: Sease.io <http://sease.io/>
        LinkedIn <https://linkedin.com/company/sease-ltd> |
        Twitter <https://twitter.com/seaseltd> | Youtube
        <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
        Github <https://github.com/seaseltd>


        On Sat, 11 May 2024 at 09:10, Michael Wechner
        <[email protected]> wrote:

            sure, no problem!

            Maybe Adrien Grand and others might also have some
            feedback :-)

            Thanks

            Michael

            Am 10.05.24 um 23:03 schrieb Chang Hank:

            Thank you for these useful resources, please allow
            me to spend some time look into it.
            I’ll let you know asap!!

            Thanks

            Hank

            On May 10, 2024, at 12:34 PM, Michael Wechner
            <[email protected]>
            <mailto:[email protected]> wrote:

            also we might want to consider how this relates to

            
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html

            In vector search reranking has become quite
            popular, e.g.

            https://docs.cohere.com/docs/reranking

            IIUC LangChain (python) for example adds the
            reranker as an argument to the searcher/retriever

            
https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/

            So maybe the following might make sense as well

            TopDocs topDocsKeyword =
            keywordSearcher.search(keywordQuery, 10);
            TopDocs topDocsVector =
            vectorSearcher.search(query, 50, new
            CohereReranker());

            TopDocs topDocs = TopDocs.merge(new RRFRanker(),
            topDocsKeyword, topDocsVector);

            WDYT?

            Thanks

            Michael


            Am 10.05.24 um 21:08 schrieb Michael Wechner:

            great, yes, let's get started :-)

            What about the following pseudo code, assuming
            that there might be alternative ranking
            algorithms to RRF

            StoredFieldsKeyword storedFieldsKeyword =
            indexReaderKeyword.storedFields();
            StoredFieldsVector storedFieldsVector =
            indexReaderKeyword.storedFields();

            TopDocs topDocsKeyword =
            keywordSearcher.search(keywordQuery, 10);
            TopDocs topDocsVector =
            vectorSearcher.search(vectorQuery, 50);

            Ranker ranker = new RRFRanker();
            TopDocs topDocs = TopDocs.rank(ranker,
            topDocsKeyword, topDocsVector);

            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                Document docK =
            storedFieldsKeyword.document(scoreDoc.doc);
                Document docV =
            storedFieldsVector.document(scoreDoc.doc);
                ....
            }

            whereas also see

            
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
            
https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

            WDYT?

            Thanks

            Michael




            Am 10.05.24 um 20:01 schrieb Chang Hank:

            Hi Michael,

            Sounds good to me.
            Let’s do it!!

            Cheers,
            Hank

            On May 10, 2024, at 10:50 AM, Michael Wechner
            <[email protected]>
            <mailto:[email protected]> wrote:

            Hi Hank

            Very cool!

            Adrien Grand suggested to implement it as a
            utility method on the TopDocs class, and since
            Adrien worked for a decade on Lucene
            https://www.elastic.co/de/blog/author/adrien-grand
            I guess it makes sense to follow his advice :-)
            We could create a PR and work together on it,
            WDYT? All the best Michael
            Am 10.05.24 um 18:51 schrieb Chang Hank:

            Hi Michael,

            Thank you for the reply.
            This is really a cool issue to work on, I’m
            happy to work on this with you. I’ll try to do
            research on RRF first.
            Also, are we going to implement this on the
            TopDocs class?

            Best,
            Hank

            On May 9, 2024, at 11:08 PM, Michael Wechner
            <[email protected]>
            <mailto:[email protected]> wrote:

            Hi Hank

            Thanks for offering your help!

            I recently suggested to implement RRF
            (Reciprocal Rank Fusion)

            https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz

            but still have not found the time to really
            work on this.

            Maybe you would be interested to do this or
            that we work on it together somehow?

            Thanks

            Michael



            Am 10.05.24 um 07:27 schrieb Chang Hank:

            Hi everyone,

            I’m Hank Chang, currently studying
            Information Retrieval topics. I’m really
            interested in contributing to Apache Lucene
            and enhance my understanding to the field.
            I’ve reviewed several issues posted on the
            Github repository but haven’t found a
            straightforward starting point. Could
            someone please recommend suitable issues for
            a newcomer like me or suggest areas I could
            assist with?

            Thank you for your time and guidance.

            Best regards,
            Hank Chang
            
---------------------------------------------------------------------
            To unsubscribe, e-mail:
            [email protected]
            For additional commands, e-mail:
            [email protected]



            
---------------------------------------------------------------------
            To unsubscribe, e-mail:
            [email protected]
            For additional commands, e-mail:
            [email protected]

--Adrien

Re: Any recommended issues to work on for a newcomer?

Reply via email to