Re: Experience re OpenAI embeddings in combination with Lucene vector search

Michael Wechner Tue, 15 Feb 2022 09:21:37 -0800

I understand, but if Lucene itself would allow to overwrite the defaultmax size programmatically, then I think it should be clear that you dothis at your own risk :-)


Thanks for the links to your blog posts, which sound very interesting.


Thanks

Michael

Am 15.02.22 um 17:25 schrieb Alessandro Benedetti:

I believe it could make sense, but as Michael pointed out in the Jiraticket related to the Solr integration, then we'll get complaints like"I set it to 1.000.000 and my Solr instance doesn't work anymore" (Ikept everything super simple just to simulate a realistic scenario).So I tend to agree to keep it to 1024 at the moment and potentiallyextend it(providing some benchmark on common machines as a referenceto justify the increase).

In terms of your original question, how are youtraining/fine-tuning your models?Using pre-trained language models won't probably help you that much,on top of that, queries are short, so you may require a well-curatedfine-tuning step.

We have a series of blog posts on that, and one is coming soon:
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html

Cheers
--------------------------
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io <http://www.sease.io>

On Tue, 15 Feb 2022 at 09:10, Michael Wechner<[email protected]> wrote:


    fair enough, but wouldn't it make sense that one can increase it
    programmatically, e.g.

    .setVectorMaxDimension(2028)

    ?

    Thanks

    Michael


    Am 14.02.22 um 23:34 schrieb Michael Sokolov:
    > I think we picked the 1024 number as something that seemed so large
    > nobody would ever want to exceed it! Obviously that was naive. Still
    > the limit serves as a cautionary point for users; if your
    vectors are
    > bigger than this, there is probably a better way to accomplish what
    > you are after (eg better off-line training to reduce
    dimensionality).
    > Is 1024 the magic number? Maybe not, but before increasing I'd
    like to
    > see some strong evidence that bigger vectors than that are indeed
    > useful as part of a search application using Lucene.
    >
    > -Mike
    >
    > On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani
    <[email protected]> wrote:
    >> Sounds good, hope the testing goes well! Memory and CPU
    (largely from more expensive vector distance calculations) are
    indeed the main factors to consider.
    >>
    >> Julie
    >>
    >> On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner
    <[email protected]> wrote:
    >>> Hi Julie
    >>>
    >>> Thanks again for your feedback!
    >>>
    >>> I will do some more tests with "all-mpnet-base-v2" (768) and
    "all-roberta-large-v1" (1024), so 1024 is enough for me for the
    moment :-)
    >>>
    >>> But yes, I could imagine, that eventually it might make sense
    to allow more dimensions than 1024.
    >>>
    >>> Beside memory and  "CPU", are there other limiting factors re
    more dimensions?
    >>>
    >>> Thanks
    >>>
    >>> Michael
    >>>
    >>> Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
    >>>
    >>> Hello Michael, the max number of dimensions is currently
    hardcoded and can't be changed. I could see an argument for
    increasing the default a bit and would be happy to discuss if
    you'd like to file a JIRA issue. However 12288 dimensions still
    seems high to me, this is much larger than most well-established
    embedding models and could require a lot of memory.
    >>>
    >>> Julie
    >>>
    >>> On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner
    <[email protected]> wrote:
    >>>> Hi Julie
    >>>>
    >>>> Thanks very much for this link, which is very interesting!
    >>>>
    >>>> Btw, do you have an idea how to increase the default max size
    of 1024?
    >>>>
    >>>> https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o
    >>>>
    >>>> Thanks
    >>>>
    >>>> Michael
    >>>>
    >>>>
    >>>>
    >>>> Am 14.02.22 um 17:45 schrieb Julie Tibshirani:
    >>>>
    >>>> Hello Michael, I don't have personal experience with these
    models, but I found this article insightful:
    
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
    It evaluates the OpenAI models against a variety of existing
    models on tasks like sentence similarity and text retrieval.
    Although the other models are cheaper and have fewer dimensions,
    the OpenAI ones perform similarly or worse. This got me thinking
    that they might not be a good cost/ effectiveness trade-off,
    especially the larger ones with 4096 or 12288 dimensions.
    >>>>
    >>>> Julie
    >>>>
    >>>> On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner
    <[email protected]> wrote:
    >>>>> Re the OpenAI embedding the following recent paper might be
    of interest
    >>>>>
    >>>>> https://arxiv.org/pdf/2201.10005.pdf
    >>>>>
    >>>>> (Text and Code Embeddings by Contrastive Pre-Training, Jan
    24, 2022)
    >>>>>
    >>>>> Thanks
    >>>>>
    >>>>> Michael
    >>>>>
    >>>>> Am 13.02.22 um 00:14 schrieb Michael Wechner:
    >>>>>
    >>>>> Here a concrete example where I combine OpenAI model
    "text-similarity-ada-001" with Lucene vector search
    >>>>>
    >>>>> INPUT sentence: "What is your age this year?"
    >>>>>
    >>>>> Result sentences
    >>>>>
    >>>>> 1) How old are you this year?
    >>>>>     score '0.98860765'
    >>>>>
    >>>>> 2) What was your age last year?
    >>>>>     score '0.97811764'
    >>>>>
    >>>>> 3) What is your age?
    >>>>>     score '0.97094905'
    >>>>>
    >>>>> 4) How old are you?
    >>>>>     score '0.9600177'
    >>>>>
    >>>>>
    >>>>> Result 1 is great and result 2 looks similar, but is not
    correct from an "understanding" point of view and results 3 and 4
    are good again.
    >>>>>
    >>>>> I understand "similarity" is not the same as
    "understanding", but I hope it makes it clearer what I am looking
    for :-)
    >>>>>
    >>>>> Thanks
    >>>>>
    >>>>> Michael
    >>>>>
    >>>>>
    >>>>>
    >>>>> Am 12.02.22 um 22:38 schrieb Michael Wechner:
    >>>>>
    >>>>> Hi Alessandro
    >>>>>
    >>>>> I am mainly interested in detecting similarity, for example
    whether the following two sentences are similar resp. likely to
    mean the same thing
    >>>>>
    >>>>> "How old are you?"
    >>>>> "What is your age?"
    >>>>>
    >>>>> and that the following two sentences are not similar, resp.
    do not mean the same thing
    >>>>>
    >>>>> "How old are you this year?"
    >>>>> "How old have you been last year?"
    >>>>>
    >>>>> But also performance or how OpenAI embeddings compare for
    example with SBERT
    (https://sbert.net/docs/usage/semantic_textual_similarity.html)
    >>>>>
    >>>>> Thanks
    >>>>>
    >>>>> Michael
    >>>>>
    >>>>>
    >>>>>
    >>>>> Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:
    >>>>>
    >>>>> Hi Michael, experience to what extent?
    >>>>> We have been exploring the area for a while given we
    contributed the first neural search milestone to Apache Solr.
    >>>>> What is your curiosity? Performance? Relevance impact? How
    to integrate it?
    >>>>> Regards
    >>>>>
    >>>>> On Fri, 11 Feb 2022, 22:38 Michael Wechner,
    <[email protected]> wrote:
    >>>>>> Hi
    >>>>>>
    >>>>>> Does anyone have experience using OpenAI embeddings in
    combination with Lucene vector search?
    >>>>>>
    >>>>>> https://beta.openai.com/docs/guides/embeddings
    >>>>>>
    >>>>>> for example comparing performance re vector size
    >>>>>>
    >>>>>>
    https://api.openai.com/v1/engines/text-similarity-ada-001/embeddings
    >>>>>>
    >>>>>> and
    >>>>>>
    >>>>>>
    https://api.openai.com/v1/engines/text-similarity-davinci-001/embeddings
    >>>>>>
    >>>>>> ?
    >>>>>>
    >>>>>>
    >>>>>> Thanks
    >>>>>>
    >>>>>> Michael
    >>>>>
    >>>>>
    >>>>>
    >
    ---------------------------------------------------------------------
    > To unsubscribe, e-mail: [email protected]
    > For additional commands, e-mail: [email protected]
    >


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Re: Experience re OpenAI embeddings in combination with Lucene vector search

Reply via email to