Re: Experience re OpenAI embeddings in combination with Lucene vector search

Michael Wechner Sun, 20 Feb 2022 13:54:57 -0800

btw, I have done some tests now with the sentence-transformer models"all-roberta-large-v1" and "all-mpnet-base-v2"


https://huggingface.co/sentence-transformers/all-roberta-large-v1
https://huggingface.co/sentence-transformers/all-mpnet-base-v2


whereas also see https://www.sbert.net/docs/pretrained_models.html

With the following input/search question

"How old have you been last year?"

I receive the following cosine distances with "all-mpnet-base-v2" (768)for the previously indexed vectors (questions)


0.22234131087379294        How old are you this year?
0.2235891372002562          What was your age last year?
0.4337717812264763          How old are you?
0.4557796164007806          What is your age?

and with "all-roberta-large-v1" (1024)

0.25013378528376184       How old are you this year?
0.2715761666421139          What was your age last year?
0.4658360947506338          What is your age?
0.4859953687958164        How old are you?

So both models do not "understand" the question.

As Alessandro suggested a "well-curated fine-tuning step" might improvethis, whereas I have not been able to try this yet.


Thanks

Michael

Am 14.02.22 um 22:02 schrieb Michael Wechner:

Hi Julie

Thanks again for your feedback!

I will do some more tests with "all-mpnet-base-v2" (768) and"all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)

But yes, I could imagine, that eventually it might make sense to allowmore dimensions than 1024.

Beside memory and "CPU", are there other limiting factors re moredimensions?


Thanks

Michael

Am 14.02.22 um 21:53 schrieb Julie Tibshirani:

Hello Michael, the max number of dimensions is currently hardcodedand can't be changed. I could see an argument for increasing thedefault a bit and would be happy to discuss if you'd like to file aJIRA issue. However 12288 dimensions still seems high to me, this ismuch larger than most well-established embedding models and couldrequire a lot of memory.


Julie

On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner<michael.wech...@wyona.com> wrote:


    Hi Julie

    Thanks very much for this link, which is very interesting!

    Btw, do you have an idea how to increase the default max size of
    1024?

    https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

    Thanks

    Michael



    Am 14.02.22 um 17:45 schrieb Julie Tibshirani:

    Hello Michael, I don't have personal experience with these
    models, but I found this article insightful:
    
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
    It evaluates the OpenAI models against a variety of existing
    models on tasks like sentence similarity and text retrieval.
    Although the other models are cheaper and have fewer dimensions,
    the OpenAI ones perform similarly or worse. This got me thinking
    that they might not be a good cost/ effectiveness trade-off,
    especially the larger ones with 4096 or 12288 dimensions.

    Julie

    On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner
    <michael.wech...@wyona.com> wrote:

        Re the OpenAI embedding the following recent paper might be
        of interest

        https://arxiv.org/pdf/2201.10005.pdf

        (Text and Code Embeddings by Contrastive Pre-Training, Jan
        24, 2022)

        Thanks

        Michael

        Am 13.02.22 um 00:14 schrieb Michael Wechner:

        Here a concrete example where I combine OpenAI model
        "text-similarity-ada-001" with Lucene vector search

        INPUT sentence: "What is your age this year?"

        Result sentences

        1) How old are you this year?
           score '0.98860765'

        2) What was your age last year?
           score '0.97811764'

        3) What is your age?
           score '0.97094905'

        4) How old are you?
           score '0.9600177'


        Result 1 is great and result 2 looks similar, but is not
        correct from an "understanding" point of view and results 3
        and 4 are good again.

        I understand "similarity" is not the same as
        "understanding", but I hope it makes it clearer what I am
        looking for :-)

        Thanks

        Michael



        Am 12.02.22 um 22:38 schrieb Michael Wechner:

        Hi Alessandro

        I am mainly interested in detecting similarity, for
        example whether the following two sentences are similar
        resp. likely to mean the same thing

        "How old are you?"
        "What is your age?"

        and that the following two sentences are not similar,
        resp. do not mean the same thing

        "How old are you this year?"
        "How old have you been last year?"

        But also performance or how OpenAI embeddings compare for
        example with SBERT
        (https://sbert.net/docs/usage/semantic_textual_similarity.html)

        Thanks

        Michael



        Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

        Hi Michael, experience to what extent?
        We have been exploring the area for a while given we
        contributed the first neural search milestone to Apache Solr.
        What is your curiosity? Performance? Relevance impact?
        How to integrate it?
        Regards

        On Fri, 11 Feb 2022, 22:38 Michael Wechner,
        <michael.wech...@wyona.com> wrote:

            Hi

            Does anyone have experience using OpenAI embeddings
            in combination with Lucene vector search?

            https://beta.openai.com/docs/guides/embeddings|

            for example comparing performance re vector size

            
||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

            and

            
||https://api.openai.com/v1/engines/||||text-similarity-davinci-001||/embeddings

            ?

            ||
            |Thanks

            Michael

Re: Experience re OpenAI embeddings in combination with Lucene vector search

Reply via email to