Re: Experience re OpenAI embeddings in combination with Lucene vector search

Michael Wechner Tue, 15 Feb 2022 11:32:59 -0800


Am 15.02.22 um 19:48 schrieb Robert Muir:

Sure, but lucene should be able to have limits. We have thisdiscussion with every single limit we attempt to implement :)There will always be extreme use cases using too many dimensions orwhatever.It is open source! I think if what you are doing is strange enough,you can modify the sources.


sure :-)

Personally, I'm concerned about increasing this limit: things arequite slow already with hundreds of dimensions.

In my particular use case the performance is not the most important, butrather the quality of the result.

But as Julie pointed out withhttps://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9more dimensions do not necessarily create better results, at least itseems to be like this in the case of sentence embeddings.

I could imagine though, that there might be other use cases where moredimensions do make a difference, but then again we can of course waituntil this actually happens

There seems to be no light at the end of the tunnel for the JDK vectorapi, I think OpenJDK will incubate this API until the sun supernovasand java is dead :)It is frustrating, as that could give current implementation a neededperformance boost on basically any hardware.


I guess you mean https://openjdk.java.net/jeps/338 right?

Also, I'm concerned about increasing limit while HNSW is the onlyimplementation. I'd like us to keep the door open to alternativealgorithms that might have better performance.

It would be great if Lucene would provide alternative algorithms in thefuture and one can choose the algorithm based on one's requirements


Thanks

Michael

On Tue, Feb 15, 2022 at 12:21 PM Michael Wechner<michael.wech...@wyona.com> wrote:


    I understand, but if Lucene itself would allow to overwrite the
    default max size programmatically, then I think it should be clear
    that you do this at your own risk :-)

    Thanks for the links to your blog posts, which sound very interesting.

    Thanks

    Michael

    Am 15.02.22 um 17:25 schrieb Alessandro Benedetti:

    I believe it could make sense, but as Michael pointed out in the
    Jira ticket related to the Solr integration, then we'll get
    complaints like "I set it to 1.000.000 and my Solr instance
    doesn't work anymore" (I kept everything super simple just to
    simulate a realistic scenario).
    So I tend to agree to keep it to 1024 at the moment and
    potentially extend it(providing some benchmark on common machines
    as a reference to justify the increase).

    In terms of your original question, how are you
    training/fine-tuning your models?
    Using pre-trained language models won't probably help you that
    much, on top of that, queries are short, so you may require a
    well-curated fine-tuning step.
    We have a series of blog posts on that, and one is coming soon:
    https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
    
https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html

    Cheers
    --------------------------
    Alessandro Benedetti
    Apache Lucene/Solr PMC member and Committer
    Director, R&D Software Engineer, Search Consultant

    www.sease.io <http://www.sease.io>


    On Tue, 15 Feb 2022 at 09:10, Michael Wechner
    <michael.wech...@wyona.com> wrote:

        fair enough, but wouldn't it make sense that one can increase it
        programmatically, e.g.

        .setVectorMaxDimension(2028)

        ?

        Thanks

        Michael


        Am 14.02.22 um 23:34 schrieb Michael Sokolov:
        > I think we picked the 1024 number as something that seemed
        so large
        > nobody would ever want to exceed it! Obviously that was
        naive. Still
        > the limit serves as a cautionary point for users; if your
        vectors are
        > bigger than this, there is probably a better way to
        accomplish what
        > you are after (eg better off-line training to reduce
        dimensionality).
        > Is 1024 the magic number? Maybe not, but before increasing
        I'd like to
        > see some strong evidence that bigger vectors than that are
        indeed
        > useful as part of a search application using Lucene.
        >
        > -Mike
        >
        > On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani
        <juliet...@gmail.com> wrote:
        >> Sounds good, hope the testing goes well! Memory and CPU
        (largely from more expensive vector distance calculations)
        are indeed the main factors to consider.
        >>
        >> Julie
        >>
        >> On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner
        <michael.wech...@wyona.com> wrote:
        >>> Hi Julie
        >>>
        >>> Thanks again for your feedback!
        >>>
        >>> I will do some more tests with "all-mpnet-base-v2" (768)
        and "all-roberta-large-v1" (1024), so 1024 is enough for me
        for the moment :-)
        >>>
        >>> But yes, I could imagine, that eventually it might make
        sense to allow more dimensions than 1024.
        >>>
        >>> Beside memory and  "CPU", are there other limiting
        factors re more dimensions?
        >>>
        >>> Thanks
        >>>
        >>> Michael
        >>>
        >>> Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
        >>>
        >>> Hello Michael, the max number of dimensions is currently
        hardcoded and can't be changed. I could see an argument for
        increasing the default a bit and would be happy to discuss if
        you'd like to file a JIRA issue. However 12288 dimensions
        still seems high to me, this is much larger than most
        well-established embedding models and could require a lot of
        memory.
        >>>
        >>> Julie
        >>>
        >>> On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner
        <michael.wech...@wyona.com> wrote:
        >>>> Hi Julie
        >>>>
        >>>> Thanks very much for this link, which is very interesting!
        >>>>
        >>>> Btw, do you have an idea how to increase the default max
        size of 1024?
        >>>>
        >>>>
        https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o
        >>>>
        >>>> Thanks
        >>>>
        >>>> Michael
        >>>>
        >>>>
        >>>>
        >>>> Am 14.02.22 um 17:45 schrieb Julie Tibshirani:
        >>>>
        >>>> Hello Michael, I don't have personal experience with
        these models, but I found this article insightful:
        
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
        It evaluates the OpenAI models against a variety of existing
        models on tasks like sentence similarity and text retrieval.
        Although the other models are cheaper and have fewer
        dimensions, the OpenAI ones perform similarly or worse. This
        got me thinking that they might not be a good cost/
        effectiveness trade-off, especially the larger ones with 4096
        or 12288 dimensions.
        >>>>
        >>>> Julie
        >>>>
        >>>> On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner
        <michael.wech...@wyona.com> wrote:
        >>>>> Re the OpenAI embedding the following recent paper
        might be of interest
        >>>>>
        >>>>> https://arxiv.org/pdf/2201.10005.pdf
        >>>>>
        >>>>> (Text and Code Embeddings by Contrastive Pre-Training,
        Jan 24, 2022)
        >>>>>
        >>>>> Thanks
        >>>>>
        >>>>> Michael
        >>>>>
        >>>>> Am 13.02.22 um 00:14 schrieb Michael Wechner:
        >>>>>
        >>>>> Here a concrete example where I combine OpenAI model
        "text-similarity-ada-001" with Lucene vector search
        >>>>>
        >>>>> INPUT sentence: "What is your age this year?"
        >>>>>
        >>>>> Result sentences
        >>>>>
        >>>>> 1) How old are you this year?
        >>>>>     score '0.98860765'
        >>>>>
        >>>>> 2) What was your age last year?
        >>>>>     score '0.97811764'
        >>>>>
        >>>>> 3) What is your age?
        >>>>>     score '0.97094905'
        >>>>>
        >>>>> 4) How old are you?
        >>>>>     score '0.9600177'
        >>>>>
        >>>>>
        >>>>> Result 1 is great and result 2 looks similar, but is
        not correct from an "understanding" point of view and results
        3 and 4 are good again.
        >>>>>
        >>>>> I understand "similarity" is not the same as
        "understanding", but I hope it makes it clearer what I am
        looking for :-)
        >>>>>
        >>>>> Thanks
        >>>>>
        >>>>> Michael
        >>>>>
        >>>>>
        >>>>>
        >>>>> Am 12.02.22 um 22:38 schrieb Michael Wechner:
        >>>>>
        >>>>> Hi Alessandro
        >>>>>
        >>>>> I am mainly interested in detecting similarity, for
        example whether the following two sentences are similar resp.
        likely to mean the same thing
        >>>>>
        >>>>> "How old are you?"
        >>>>> "What is your age?"
        >>>>>
        >>>>> and that the following two sentences are not similar,
        resp. do not mean the same thing
        >>>>>
        >>>>> "How old are you this year?"
        >>>>> "How old have you been last year?"
        >>>>>
        >>>>> But also performance or how OpenAI embeddings compare
        for example with SBERT
        (https://sbert.net/docs/usage/semantic_textual_similarity.html)
        >>>>>
        >>>>> Thanks
        >>>>>
        >>>>> Michael
        >>>>>
        >>>>>
        >>>>>
        >>>>> Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:
        >>>>>
        >>>>> Hi Michael, experience to what extent?
        >>>>> We have been exploring the area for a while given we
        contributed the first neural search milestone to Apache Solr.
        >>>>> What is your curiosity? Performance? Relevance impact?
        How to integrate it?
        >>>>> Regards
        >>>>>
        >>>>> On Fri, 11 Feb 2022, 22:38 Michael Wechner,
        <michael.wech...@wyona.com> wrote:
        >>>>>> Hi
        >>>>>>
        >>>>>> Does anyone have experience using OpenAI embeddings in
        combination with Lucene vector search?
        >>>>>>
        >>>>>> https://beta.openai.com/docs/guides/embeddings
        >>>>>>
        >>>>>> for example comparing performance re vector size
        >>>>>>
        >>>>>>
        https://api.openai.com/v1/engines/text-similarity-ada-001/embeddings
        >>>>>>
        >>>>>> and
        >>>>>>
        >>>>>>
        https://api.openai.com/v1/engines/text-similarity-davinci-001/embeddings
        >>>>>>
        >>>>>> ?
        >>>>>>
        >>>>>>
        >>>>>> Thanks
        >>>>>>
        >>>>>> Michael
        >>>>>
        >>>>>
        >>>>>
        >
        ---------------------------------------------------------------------
        > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
        > For additional commands, e-mail: dev-h...@lucene.apache.org
        >


        ---------------------------------------------------------------------
        To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
        For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Experience re OpenAI embeddings in combination with Lucene vector search

Reply via email to