Re: Any recommended issues to work on for a newcomer?

Michael Wechner Fri, 31 May 2024 14:30:30 -0700

thank you very much for sharing!

Unfortunately I did not find time yet to review Hank's work yet, butmaybe Hank can already proceed based on your code.


Thanks

Michael

Am 31.05.24 um 18:50 schrieb Alessandro Benedetti:

Just for your curiosity, my Reciprocal Rank Fusion contribution toSolr is in decent shape now:

https://github.com/apache/solr/pull/2489

Everything is just Solr's side but maybe it can be of some sort ofinspiration if you want to do a similar work in Lucene.


Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: [email protected]/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>

LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter<https://twitter.com/seaseltd> | Youtube<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github<https://github.com/seaseltd>

On Mon, 20 May 2024 at 08:16, Michael Wechner<[email protected]> wrote:


    Hi Hank

    Very cool, thank you, will try to do this asap!

    All the best

    Michael


    Am 19.05.24 um 01:42 schrieb Chang Hank:

    Hey Michael,

    I wrote the first version of my idea about implementing RRF in
    Lucene, here the link of the code
    https://gist.github.com/hack4chang/ee2b37eab80bd82e574ff4f94ed204e9.
    Right now I have some questions, one is about the shardIndex to
    be returned, another one is the TotalHits value, please take a
    look at the code and kindly leave some comments below.

    Thanks,
    Hank

    On May 18, 2024, at 2:01 PM, Chang Hank
    <[email protected]> <mailto:[email protected]> wrote:

    Or maybe we can first create an issue and PR based on the issue
    number?
    WDYT?

    Best,

    Hank

    On May 18, 2024, at 11:29 AM, Chang Hank
    <[email protected]> <mailto:[email protected]> wrote:

    Hey Michael,

    Sorry I was a bit busy this week, but I’ve looked into the
    resources you provided and also some useful advice from
    Alessandro and Adrien.

    I have a briefly understanding of how RRF works, but I’m not
    quite sure how we should implement it. Based on the advice from
    Alessandro and Adrien, it seems we need to consider that the
    search results are located at different shards. According to
    Alessandro, we should aggregate the ranked lists from all
    distributed nodes and then apply RRF.
    Are we going to implement this aggregation logic inside our RRF
    method?

    Also could you please create a PR so we can discuss more
    details further?

    All the best,

    Hank

    On May 13, 2024, at 10:09 AM, Michael Wechner
    <[email protected]> <mailto:[email protected]>
    wrote:

    Great, sounds like we have plan :-)

    Hank and I can get started trying to understand the internals
    better ...

    Thanks

    Michael

    Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:

    Sure, we can make it work but in a distributed environment
    you have to run first each query distributed (aggregating all
    nodes) and then RRF on top of the aggregated ranked lists.
    Doing RRF per node first and then aggregate per shard won't
    return the same results I suspect.
    When I go back to working on the task I'll be able to
    elaborate more!

    Cheers
    --------------------------
    *Alessandro Benedetti*
    Director @ Sease Ltd.
    /Apache Lucene/Solr Committer/
    /Apache Solr PMC Member/

    e-mail: [email protected]/
    /

    *Sease* - Information Retrieval Applied
    Consulting | Training | Open Source

    Website: Sease.io <http://sease.io/>
    LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
    <https://twitter.com/seaseltd> | Youtube
    <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
    Github <https://github.com/seaseltd>


    On Mon, 13 May 2024 at 14:12, Adrien Grand
    <[email protected]> wrote:

        > Maybe Adrien Grand and others might also have some
        feedback :-)

        I'd suggest the signature to look something like `TopDocs
        TopDocs#rrf(int topN, int k, TopDocs[] hits)` to be
        consistent with `TopDocs#merge`. Internally, it should
        look at `ScoreDoc#shardId` and `ScoreDoc#doc` to figure
        out which hits map to the same document.

        > Back in the day, I was reasoning on this and I didn't
        think Lucene was the right place for an interleaving
        algorithm, given that Reciprocal Rank Fusion is affected
        by distribution and it's not supposed to work per node.

        To me this is like `TopDocs#merge`. There are changes
        needed on the application side to hook this call into the
        logic that combines hits that come from multiple shards
        (multiple queries in the case of RRF), but Lucene can
        still provide the merging logic.

        On Mon, May 13, 2024 at 1:41 PM Michael Wechner
        <[email protected]> wrote:

            Thanks for your feedback Alessandro!

            I am using Lucene independent of Solr or OpenSearch,
            Elasticsearch, but would like to combine different
            result sets using RRF, therefore think that Lucene
            itself could be a good place actually.

            Looking forward to your additional elaboration!

            Thanks

            Michael

            Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti
            <[email protected]>:

            This is not strictly related to Lucene, but I'll
            give a talk at Berlin Buzzwords on how I am
            implementing Reciprocal Rank Fusion in Apache Solr.
            I'll resume my work on the contribution next week
            and have more to share later.

            Back in the day, I was reasoning on this and I
            didn't think Lucene was the right place for an
            interleaving algorithm, given that Reciprocal Rank
            Fusion is affected by distribution and it's not
            supposed to work per node.
            I think I evaluated the possibility of doing it as a
            Lucene query or a Lucene component but then ended up
            with a different approach.
            I'll elaborate more when I go back to the task!

            Cheers
            --------------------------
            *Alessandro Benedetti*
            Director @ Sease Ltd.
            /Apache Lucene/Solr Committer/
            /Apache Solr PMC Member/

            e-mail: [email protected]/
            /

            *Sease* - Information Retrieval Applied
            Consulting | Training | Open Source

            Website: Sease.io <http://sease.io/>
            LinkedIn <https://linkedin.com/company/sease-ltd> |
            Twitter <https://twitter.com/seaseltd> | Youtube
            <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
            Github <https://github.com/seaseltd>


            On Sat, 11 May 2024 at 09:10, Michael Wechner
            <[email protected]> wrote:

                sure, no problem!

                Maybe Adrien Grand and others might also have
                some feedback :-)

                Thanks

                Michael

                Am 10.05.24 um 23:03 schrieb Chang Hank:

                Thank you for these useful resources, please
                allow me to spend some time look into it.
                I’ll let you know asap!!

                Thanks

                Hank

                On May 10, 2024, at 12:34 PM, Michael Wechner
                <[email protected]>
                <mailto:[email protected]> wrote:

                also we might want to consider how this relates to

                
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html

                In vector search reranking has become quite
                popular, e.g.

                https://docs.cohere.com/docs/reranking

                IIUC LangChain (python) for example adds the
                reranker as an argument to the searcher/retriever

                
https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/

                So maybe the following might make sense as well

                TopDocs topDocsKeyword =
                keywordSearcher.search(keywordQuery, 10);
                TopDocs topDocsVector =
                vectorSearcher.search(query, 50, new
                CohereReranker());

                TopDocs topDocs = TopDocs.merge(new
                RRFRanker(), topDocsKeyword, topDocsVector);

                WDYT?

                Thanks

                Michael


                Am 10.05.24 um 21:08 schrieb Michael Wechner:

                great, yes, let's get started :-)

                What about the following pseudo code,
                assuming that there might be alternative
                ranking algorithms to RRF

                StoredFieldsKeyword storedFieldsKeyword =
                indexReaderKeyword.storedFields();
                StoredFieldsVector storedFieldsVector =
                indexReaderKeyword.storedFields();

                TopDocs topDocsKeyword =
                keywordSearcher.search(keywordQuery, 10);
                TopDocs topDocsVector =
                vectorSearcher.search(vectorQuery, 50);

                Ranker ranker = new RRFRanker();
                TopDocs topDocs = TopDocs.rank(ranker,
                topDocsKeyword, topDocsVector);

                for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                    Document docK =
                storedFieldsKeyword.document(scoreDoc.doc);
                    Document docV =
                storedFieldsVector.document(scoreDoc.doc);
                    ....
                }

                whereas also see

                
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
                
https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

                WDYT?

                Thanks

                Michael




                Am 10.05.24 um 20:01 schrieb Chang Hank:

                Hi Michael,

                Sounds good to me.
                Let’s do it!!

                Cheers,
                Hank

                On May 10, 2024, at 10:50 AM, Michael
                Wechner <[email protected]>
                <mailto:[email protected]> wrote:

                Hi Hank

                Very cool!

                Adrien Grand suggested to implement it as a
                utility method on the TopDocs class, and
                since Adrien worked for a decade on Lucene
                https://www.elastic.co/de/blog/author/adrien-grand
                I guess it makes sense to follow his advice
                :-) We could create a PR and work together
                on it, WDYT? All the best Michael
                Am 10.05.24 um 18:51 schrieb Chang Hank:

                Hi Michael,

                Thank you for the reply.
                This is really a cool issue to work on,
                I’m happy to work on this with you. I’ll
                try to do research on RRF first.
                Also, are we going to implement this on
                the TopDocs class?

                Best,
                Hank

                On May 9, 2024, at 11:08 PM, Michael
                Wechner <[email protected]>
                <mailto:[email protected]> wrote:

                Hi Hank

                Thanks for offering your help!

                I recently suggested to implement RRF
                (Reciprocal Rank Fusion)

                https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz

                but still have not found the time to
                really work on this.

                Maybe you would be interested to do this
                or that we work on it together somehow?

                Thanks

                Michael



                Am 10.05.24 um 07:27 schrieb Chang Hank:

                Hi everyone,

                I’m Hank Chang, currently studying
                Information Retrieval topics. I’m really
                interested in contributing to Apache
                Lucene and enhance my understanding to
                the field.
                I’ve reviewed several issues posted on
                the Github repository but haven’t found
                a straightforward starting point. Could
                someone please recommend suitable issues
                for a newcomer like me or suggest areas
                I could assist with?

                Thank you for your time and guidance.

                Best regards,
                Hank Chang
                
---------------------------------------------------------------------
                To unsubscribe, e-mail:
                [email protected]
                For additional commands, e-mail:
                [email protected]



                
---------------------------------------------------------------------
                To unsubscribe, e-mail:
                [email protected]
                For additional commands, e-mail:
                [email protected]

--Adrien

Re: Any recommended issues to work on for a newcomer?

Reply via email to