Re: Any recommended issues to work on for a newcomer?

Michael Wechner Mon, 13 May 2024 10:10:04 -0700

Great, sounds like we have plan :-)

Hank and I can get started trying to understand the internals better ...


Thanks

Michael

Am 13.05.24 um 18:21 schrieb Alessandro Benedetti:

Sure, we can make it work but in a distributed environment you have torun first each query distributed (aggregating all nodes) and then RRFon top of the aggregated ranked lists.Doing RRF per node first and then aggregate per shard won't return thesame results I suspect.

When I go back to working on the task I'll be able to elaborate more!

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>

LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter<https://twitter.com/seaseltd> | Youtube<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github<https://github.com/seaseltd>



On Mon, 13 May 2024 at 14:12, Adrien Grand <jpou...@gmail.com> wrote:

    > Maybe Adrien Grand and others might also have some feedback :-)

    I'd suggest the signature to look something like `TopDocs
    TopDocs#rrf(int topN, int k, TopDocs[] hits)` to be consistent
    with `TopDocs#merge`. Internally, it should look at
    `ScoreDoc#shardId` and `ScoreDoc#doc` to figure out which hits map
    to the same document.

    > Back in the day, I was reasoning on this and I didn't think
    Lucene was the right place for an interleaving algorithm, given
    that Reciprocal Rank Fusion is affected by distribution and it's
    not supposed to work per node.

    To me this is like `TopDocs#merge`. There are changes needed on
    the application side to hook this call into the logic that
    combines hits that come from multiple shards (multiple queries in
    the case of RRF), but Lucene can still provide the merging logic.

    On Mon, May 13, 2024 at 1:41 PM Michael Wechner
    <michael.wech...@wyona.com> wrote:

        Thanks for your feedback Alessandro!

        I am using Lucene independent of Solr or OpenSearch,
        Elasticsearch, but would like to combine different result sets
        using RRF, therefore think that Lucene itself could be a good
        place actually.

        Looking forward to your additional elaboration!

        Thanks

        Michael

        Am 13.05.2024 um 12:34 schrieb Alessandro Benedetti
        <a.benede...@sease.io>:

        This is not strictly related to Lucene, but I'll give a talk
        at Berlin Buzzwords on how I am implementing Reciprocal Rank
        Fusion in Apache Solr.
        I'll resume my work on the contribution next week and have
        more to share later.

        Back in the day, I was reasoning on this and I didn't think
        Lucene was the right place for an interleaving algorithm,
        given that Reciprocal Rank Fusion is affected by distribution
        and it's not supposed to work per node.
        I think I evaluated the possibility of doing it as a Lucene
        query or a Lucene component but then ended up with a
        different approach.
        I'll elaborate more when I go back to the task!

        Cheers
        --------------------------
        *Alessandro Benedetti*
        Director @ Sease Ltd.
        /Apache Lucene/Solr Committer/
        /Apache Solr PMC Member/

        e-mail: a.benede...@sease.io/
        /

        *Sease* - Information Retrieval Applied
        Consulting | Training | Open Source

        Website: Sease.io <http://sease.io/>
        LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
        <https://twitter.com/seaseltd> | Youtube
        <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
        Github <https://github.com/seaseltd>


        On Sat, 11 May 2024 at 09:10, Michael Wechner
        <michael.wech...@wyona.com> wrote:

            sure, no problem!

            Maybe Adrien Grand and others might also have some
            feedback :-)

            Thanks

            Michael

            Am 10.05.24 um 23:03 schrieb Chang Hank:

            Thank you for these useful resources, please allow me to
            spend some time look into it.
            I’ll let you know asap!!

            Thanks

            Hank

            On May 10, 2024, at 12:34 PM, Michael Wechner
            <michael.wech...@wyona.com>
            <mailto:michael.wech...@wyona.com> wrote:

            also we might want to consider how this relates to

            
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/Rescorer.html

            In vector search reranking has become quite popular, e.g.

            https://docs.cohere.com/docs/reranking

            IIUC LangChain (python) for example adds the reranker
            as an argument to the searcher/retriever

            
https://python.langchain.com/v0.1/docs/integrations/retrievers/cohere-reranker/

            So maybe the following might make sense as well

            TopDocs topDocsKeyword =
            keywordSearcher.search(keywordQuery, 10);
            TopDocs topDocsVector = vectorSearcher.search(query,
            50, new CohereReranker());

            TopDocs topDocs = TopDocs.merge(new RRFRanker(),
            topDocsKeyword, topDocsVector);

            WDYT?

            Thanks

            Michael


            Am 10.05.24 um 21:08 schrieb Michael Wechner:

            great, yes, let's get started :-)

            What about the following pseudo code, assuming that
            there might be alternative ranking algorithms to RRF

            StoredFieldsKeyword storedFieldsKeyword =
            indexReaderKeyword.storedFields();
            StoredFieldsVector storedFieldsVector =
            indexReaderKeyword.storedFields();

            TopDocs topDocsKeyword =
            keywordSearcher.search(keywordQuery, 10);
            TopDocs topDocsVector =
            vectorSearcher.search(vectorQuery, 50);

            Ranker ranker = new RRFRanker();
            TopDocs topDocs = TopDocs.rank(ranker, topDocsKeyword,
            topDocsVector);

            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                Document docK =
            storedFieldsKeyword.document(scoreDoc.doc);
                Document docV =
            storedFieldsVector.document(scoreDoc.doc);
                ....
            }

            whereas also see

            
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/search/TopDocs.html
            
https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

            WDYT?

            Thanks

            Michael




            Am 10.05.24 um 20:01 schrieb Chang Hank:

            Hi Michael,

            Sounds good to me.
            Let’s do it!!

            Cheers,
            Hank

            On May 10, 2024, at 10:50 AM, Michael Wechner
            <michael.wech...@wyona.com>
            <mailto:michael.wech...@wyona.com> wrote:

            Hi Hank

            Very cool!

            Adrien Grand suggested to implement it as a utility
            method on the TopDocs class, and since Adrien worked
            for a decade on Lucene
            https://www.elastic.co/de/blog/author/adrien-grand I
            guess it makes sense to follow his advice :-) We
            could create a PR and work together on it, WDYT? All
            the best Michael
            Am 10.05.24 um 18:51 schrieb Chang Hank:

            Hi Michael,

            Thank you for the reply.
            This is really a cool issue to work on, I’m happy
            to work on this with you. I’ll try to do research
            on RRF first.
            Also, are we going to implement this on the TopDocs
            class?

            Best,
            Hank

            On May 9, 2024, at 11:08 PM, Michael Wechner
            <michael.wech...@wyona.com>
            <mailto:michael.wech...@wyona.com> wrote:

            Hi Hank

            Thanks for offering your help!

            I recently suggested to implement RRF (Reciprocal
            Rank Fusion)

            https://lists.apache.org/thread/vvwvjl0gk67okn8z1wg33ogyf9qm07sz

            but still have not found the time to really work
            on this.

            Maybe you would be interested to do this or that
            we work on it together somehow?

            Thanks

            Michael



            Am 10.05.24 um 07:27 schrieb Chang Hank:

            Hi everyone,

            I’m Hank Chang, currently studying Information
            Retrieval topics. I’m really interested in
            contributing to Apache Lucene and enhance my
            understanding to the field.
            I’ve reviewed several issues posted on the Github
            repository but haven’t found a straightforward
            starting point. Could someone please recommend
            suitable issues for a newcomer like me or suggest
            areas I could assist with?

            Thank you for your time and guidance.

            Best regards,
            Hank Chang
            
---------------------------------------------------------------------
            To unsubscribe, e-mail:
            dev-unsubscr...@lucene.apache.org
            For additional commands, e-mail:
            dev-h...@lucene.apache.org



            
---------------------------------------------------------------------
            To unsubscribe, e-mail:
            dev-unsubscr...@lucene.apache.org
            For additional commands, e-mail:
            dev-h...@lucene.apache.org

--Adrien

Re: Any recommended issues to work on for a newcomer?

Reply via email to