Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-20 Thread Michael Wechner
btw, I have done some tests now with the sentence-transformer models 
"all-roberta-large-v1" and "all-mpnet-base-v2"


https://huggingface.co/sentence-transformers/all-roberta-large-v1
https://huggingface.co/sentence-transformers/all-mpnet-base-v2

whereas also see https://www.sbert.net/docs/pretrained_models.html

With the following input/search question

"How old have you been last year?"

I receive the following cosine distances with "all-mpnet-base-v2" (768) 
for the previously indexed vectors (questions)


0.22234131087379294        How old are you this year?
0.2235891372002562      What was your age last year?
0.4337717812264763      How old are you?
0.4557796164007806      What is your age?

and with "all-roberta-large-v1" (1024)

0.25013378528376184       How old are you this year?
0.2715761666421139      What was your age last year?
0.4658360947506338      What is your age?
0.4859953687958164        How old are you?

So both models do not "understand" the question.

As Alessandro suggested a "well-curated fine-tuning step" might improve 
this, whereas I have not been able to try this yet.


Thanks

Michael

Am 14.02.22 um 22:02 schrieb Michael Wechner:

Hi Julie

Thanks again for your feedback!

I will do some more tests with "all-mpnet-base-v2" (768) and 
"all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)


But yes, I could imagine, that eventually it might make sense to allow 
more dimensions than 1024.


Beside memory and  "CPU", are there other limiting factors re more 
dimensions?


Thanks

Michael

Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
Hello Michael, the max number of dimensions is currently hardcoded 
and can't be changed. I could see an argument for increasing the 
default a bit and would be happy to discuss if you'd like to file a 
JIRA issue. However 12288 dimensions still seems high to me, this is 
much larger than most well-established embedding models and could 
require a lot of memory.


Julie

On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner 
 wrote:


Hi Julie

Thanks very much for this link, which is very interesting!

Btw, do you have an idea how to increase the default max size of
1024?

https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

Thanks

Michael



Am 14.02.22 um 17:45 schrieb Julie Tibshirani:

Hello Michael, I don't have personal experience with these
models, but I found this article insightful:

https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
It evaluates the OpenAI models against a variety of existing
models on tasks like sentence similarity and text retrieval.
Although the other models are cheaper and have fewer dimensions,
the OpenAI ones perform similarly or worse. This got me thinking
that they might not be a good cost/ effectiveness trade-off,
especially the larger ones with 4096 or 12288 dimensions.

Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner
 wrote:

Re the OpenAI embedding the following recent paper might be
of interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan
24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:

Here a concrete example where I combine OpenAI model
"text-similarity-ada-001" with Lucene vector search

INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not
correct from an "understanding" point of view and results 3
and 4 are good again.

I understand "similarity" is not the same as
"understanding", but I hope it makes it clearer what I am
looking for :-)

Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for
example whether the following two sentences are similar
resp. likely to mean the same thing

"How old are you?"
"What is your age?"

and that the following two sentences are not similar,
resp. do not mean the same thing

"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for
example with SBERT
(https://sbert.net/docs/usage/semantic_textual_similarity.html)

Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what 

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Robert Muir
On Tue, Feb 15, 2022 at 2:33 PM Michael Wechner 
wrote:

>
> There seems to be no light at the end of the tunnel for the JDK vector
> api, I think OpenJDK will incubate this API until the sun supernovas and
> java is dead :)
> It is frustrating, as that could give current implementation a needed
> performance boost on basically any hardware.
>
>
> I guess you mean https://openjdk.java.net/jeps/338 right?
>
>
>
Yes, but also these:

https://openjdk.java.net/jeps/414
https://openjdk.java.net/jeps/417
https://openjdk.java.net/jeps/8280173


Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Michael Wechner



Am 15.02.22 um 19:48 schrieb Robert Muir:
Sure, but lucene should be able to have limits. We have this 
discussion with every single limit we attempt to implement :)
There will always be extreme use cases using too many dimensions or 
whatever.
It is open source! I think if what you are doing is strange enough, 
you can modify the sources.


sure :-)



Personally, I'm concerned about increasing this limit: things are 
quite slow already with hundreds of dimensions.


In my particular use case the performance is not the most important, but 
rather the quality of the result.


But as Julie pointed out with 
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9 
more dimensions do not necessarily create better results, at least it 
seems to be like this in the case of sentence embeddings.


I could imagine though, that there might be other use cases where more 
dimensions do make a difference, but then again we can of course wait 
until this actually happens


There seems to be no light at the end of the tunnel for the JDK vector 
api, I think OpenJDK will incubate this API until the sun supernovas 
and java is dead :)
It is frustrating, as that could give current implementation a needed 
performance boost on basically any hardware.


I guess you mean https://openjdk.java.net/jeps/338 right?




Also, I'm concerned about increasing limit while HNSW is the only 
implementation. I'd like us to keep the door open to alternative 
algorithms that might have better performance.


It would be great if Lucene would provide alternative algorithms in the 
future and one can choose the algorithm based on one's requirements


Thanks

Michael




On Tue, Feb 15, 2022 at 12:21 PM Michael Wechner 
 wrote:


I understand, but if Lucene itself would allow to overwrite the
default max size programmatically, then I think it should be clear
that you do this at your own risk :-)

Thanks for the links to your blog posts, which sound very interesting.

Thanks

Michael

Am 15.02.22 um 17:25 schrieb Alessandro Benedetti:

I believe it could make sense, but as Michael pointed out in the
Jira ticket related to the Solr integration, then we'll get
complaints like "I set it to 1.000.000 and my Solr instance
doesn't work anymore" (I kept everything super simple just to
simulate a realistic scenario).
So I tend to agree to keep it to 1024 at the moment and
potentially extend it(providing some benchmark on common machines
as a reference to justify the increase).

In terms of your original question, how are you
training/fine-tuning your models?
Using pre-trained language models won't probably help you that
much, on top of that, queries are short, so you may require a
well-curated fine-tuning step.
We have a series of blog posts on that, and one is coming soon:
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html

https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html

Cheers
--
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R Software Engineer, Search Consultant

www.sease.io 


On Tue, 15 Feb 2022 at 09:10, Michael Wechner
 wrote:

fair enough, but wouldn't it make sense that one can increase it
programmatically, e.g.

.setVectorMaxDimension(2028)

?

Thanks

Michael


Am 14.02.22 um 23:34 schrieb Michael Sokolov:
> I think we picked the 1024 number as something that seemed
so large
> nobody would ever want to exceed it! Obviously that was
naive. Still
> the limit serves as a cautionary point for users; if your
vectors are
> bigger than this, there is probably a better way to
accomplish what
> you are after (eg better off-line training to reduce
dimensionality).
> Is 1024 the magic number? Maybe not, but before increasing
I'd like to
> see some strong evidence that bigger vectors than that are
indeed
> useful as part of a search application using Lucene.
>
> -Mike
>
> On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani
 wrote:
>> Sounds good, hope the testing goes well! Memory and CPU
(largely from more expensive vector distance calculations)
are indeed the main factors to consider.
>>
>> Julie
>>
>> On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner
 wrote:
>>> Hi Julie
>>>
>>> Thanks again for your feedback!
>>>
>>> I will do some more tests with "all-mpnet-base-v2" (768)
and "all-roberta-large-v1" (1024), so 1024 is enough for me
for the moment :-)
>>>
>>> But yes, I could imagine, 

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Robert Muir
Sure, but lucene should be able to have limits. We have this discussion
with every single limit we attempt to implement :)
There will always be extreme use cases using too many dimensions or
whatever.
It is open source! I think if what you are doing is strange enough, you can
modify the sources.

Personally, I'm concerned about increasing this limit: things are quite
slow already with hundreds of dimensions.
There seems to be no light at the end of the tunnel for the JDK vector api,
I think OpenJDK will incubate this API until the sun supernovas and java is
dead :)
It is frustrating, as that could give current implementation a needed
performance boost on basically any hardware.

Also, I'm concerned about increasing limit while HNSW is the only
implementation. I'd like us to keep the door open to alternative algorithms
that might have better performance.

On Tue, Feb 15, 2022 at 12:21 PM Michael Wechner 
wrote:

> I understand, but if Lucene itself would allow to overwrite the default
> max size programmatically, then I think it should be clear that you do this
> at your own risk :-)
>
> Thanks for the links to your blog posts, which sound very interesting.
>
> Thanks
>
> Michael
>
> Am 15.02.22 um 17:25 schrieb Alessandro Benedetti:
>
> I believe it could make sense, but as Michael pointed out in the Jira
> ticket related to the Solr integration, then we'll get complaints like "I
> set it to 1.000.000 and my Solr instance doesn't work anymore" (I kept
> everything super simple just to simulate a realistic scenario).
> So I tend to agree to keep it to 1024 at the moment and potentially extend
> it(providing some benchmark on common machines as a reference to justify
> the increase).
>
> In terms of your original question, how are you training/fine-tuning your
> models?
> Using pre-trained language models won't probably help you that much, on
> top of that, queries are short, so you may require a well-curated
> fine-tuning step.
> We have a series of blog posts on that, and one is coming soon:
> https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
>
> https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html
>
> Cheers
> --
> Alessandro Benedetti
> Apache Lucene/Solr PMC member and Committer
> Director, R Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Tue, 15 Feb 2022 at 09:10, Michael Wechner 
> wrote:
>
>> fair enough, but wouldn't it make sense that one can increase it
>> programmatically, e.g.
>>
>> .setVectorMaxDimension(2028)
>>
>> ?
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 14.02.22 um 23:34 schrieb Michael Sokolov:
>> > I think we picked the 1024 number as something that seemed so large
>> > nobody would ever want to exceed it! Obviously that was naive. Still
>> > the limit serves as a cautionary point for users; if your vectors are
>> > bigger than this, there is probably a better way to accomplish what
>> > you are after (eg better off-line training to reduce dimensionality).
>> > Is 1024 the magic number? Maybe not, but before increasing I'd like to
>> > see some strong evidence that bigger vectors than that are indeed
>> > useful as part of a search application using Lucene.
>> >
>> > -Mike
>> >
>> > On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani 
>> wrote:
>> >> Sounds good, hope the testing goes well! Memory and CPU (largely from
>> more expensive vector distance calculations) are indeed the main factors to
>> consider.
>> >>
>> >> Julie
>> >>
>> >> On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner <
>> michael.wech...@wyona.com> wrote:
>> >>> Hi Julie
>> >>>
>> >>> Thanks again for your feedback!
>> >>>
>> >>> I will do some more tests with "all-mpnet-base-v2" (768) and
>> "all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)
>> >>>
>> >>> But yes, I could imagine, that eventually it might make sense to
>> allow more dimensions than 1024.
>> >>>
>> >>> Beside memory and  "CPU", are there other limiting factors re more
>> dimensions?
>> >>>
>> >>> Thanks
>> >>>
>> >>> Michael
>> >>>
>> >>> Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
>> >>>
>> >>> Hello Michael, the max number of dimensions is currently hardcoded
>> and can't be changed. I could see an argument for increasing the default a
>> bit and would be happy to discuss if you'd like to file a JIRA issue.
>> However 12288 dimensions still seems high to me, this is much larger than
>> most well-established embedding models and could require a lot of memory.
>> >>>
>> >>> Julie
>> >>>
>> >>> On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner <
>> michael.wech...@wyona.com> wrote:
>>  Hi Julie
>> 
>>  Thanks very much for this link, which is very interesting!
>> 
>>  Btw, do you have an idea how to increase the default max size of
>> 1024?
>> 
>>  https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o
>> 
>>  Thanks
>> 
>>  Michael
>> 
>> 
>> 
>>  Am 14.02.22 um 17:45 

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Michael Wechner
I understand, but if Lucene itself would allow to overwrite the default 
max size programmatically, then I think it should be clear that you do 
this at your own risk :-)


Thanks for the links to your blog posts, which sound very interesting.

Thanks

Michael

Am 15.02.22 um 17:25 schrieb Alessandro Benedetti:
I believe it could make sense, but as Michael pointed out in the Jira 
ticket related to the Solr integration, then we'll get complaints like 
"I set it to 1.000.000 and my Solr instance doesn't work anymore" (I 
kept everything super simple just to simulate a realistic scenario).
So I tend to agree to keep it to 1024 at the moment and potentially 
extend it(providing some benchmark on common machines as a reference 
to justify the increase).


In terms of your original question, how are you 
training/fine-tuning your models?
Using pre-trained language models won't probably help you that much, 
on top of that, queries are short, so you may require a well-curated 
fine-tuning step.

We have a series of blog posts on that, and one is coming soon:
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html

Cheers
--
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R Software Engineer, Search Consultant

www.sease.io 


On Tue, 15 Feb 2022 at 09:10, Michael Wechner 
 wrote:


fair enough, but wouldn't it make sense that one can increase it
programmatically, e.g.

.setVectorMaxDimension(2028)

?

Thanks

Michael


Am 14.02.22 um 23:34 schrieb Michael Sokolov:
> I think we picked the 1024 number as something that seemed so large
> nobody would ever want to exceed it! Obviously that was naive. Still
> the limit serves as a cautionary point for users; if your
vectors are
> bigger than this, there is probably a better way to accomplish what
> you are after (eg better off-line training to reduce
dimensionality).
> Is 1024 the magic number? Maybe not, but before increasing I'd
like to
> see some strong evidence that bigger vectors than that are indeed
> useful as part of a search application using Lucene.
>
> -Mike
>
> On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani
 wrote:
>> Sounds good, hope the testing goes well! Memory and CPU
(largely from more expensive vector distance calculations) are
indeed the main factors to consider.
>>
>> Julie
>>
>> On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner
 wrote:
>>> Hi Julie
>>>
>>> Thanks again for your feedback!
>>>
>>> I will do some more tests with "all-mpnet-base-v2" (768) and
"all-roberta-large-v1" (1024), so 1024 is enough for me for the
moment :-)
>>>
>>> But yes, I could imagine, that eventually it might make sense
to allow more dimensions than 1024.
>>>
>>> Beside memory and  "CPU", are there other limiting factors re
more dimensions?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
>>>
>>> Hello Michael, the max number of dimensions is currently
hardcoded and can't be changed. I could see an argument for
increasing the default a bit and would be happy to discuss if
you'd like to file a JIRA issue. However 12288 dimensions still
seems high to me, this is much larger than most well-established
embedding models and could require a lot of memory.
>>>
>>> Julie
>>>
>>> On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner
 wrote:
 Hi Julie

 Thanks very much for this link, which is very interesting!

 Btw, do you have an idea how to increase the default max size
of 1024?

 https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

 Thanks

 Michael



 Am 14.02.22 um 17:45 schrieb Julie Tibshirani:

 Hello Michael, I don't have personal experience with these
models, but I found this article insightful:

https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
It evaluates the OpenAI models against a variety of existing
models on tasks like sentence similarity and text retrieval.
Although the other models are cheaper and have fewer dimensions,
the OpenAI ones perform similarly or worse. This got me thinking
that they might not be a good cost/ effectiveness trade-off,
especially the larger ones with 4096 or 12288 dimensions.

 Julie

 On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner
 wrote:
> Re the OpenAI embedding the following recent paper might be
of interest
>
> https://arxiv.org/pdf/2201.10005.pdf
>

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Alessandro Benedetti
I believe it could make sense, but as Michael pointed out in the Jira
ticket related to the Solr integration, then we'll get complaints like "I
set it to 1.000.000 and my Solr instance doesn't work anymore" (I kept
everything super simple just to simulate a realistic scenario).
So I tend to agree to keep it to 1024 at the moment and potentially extend
it(providing some benchmark on common machines as a reference to justify
the increase).

In terms of your original question, how are you training/fine-tuning your
models?
Using pre-trained language models won't probably help you that much, on top
of that, queries are short, so you may require a well-curated fine-tuning
step.
We have a series of blog posts on that, and one is coming soon:
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html

Cheers
--
Alessandro Benedetti
Apache Lucene/Solr PMC member and Committer
Director, R Software Engineer, Search Consultant

www.sease.io


On Tue, 15 Feb 2022 at 09:10, Michael Wechner 
wrote:

> fair enough, but wouldn't it make sense that one can increase it
> programmatically, e.g.
>
> .setVectorMaxDimension(2028)
>
> ?
>
> Thanks
>
> Michael
>
>
> Am 14.02.22 um 23:34 schrieb Michael Sokolov:
> > I think we picked the 1024 number as something that seemed so large
> > nobody would ever want to exceed it! Obviously that was naive. Still
> > the limit serves as a cautionary point for users; if your vectors are
> > bigger than this, there is probably a better way to accomplish what
> > you are after (eg better off-line training to reduce dimensionality).
> > Is 1024 the magic number? Maybe not, but before increasing I'd like to
> > see some strong evidence that bigger vectors than that are indeed
> > useful as part of a search application using Lucene.
> >
> > -Mike
> >
> > On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani 
> wrote:
> >> Sounds good, hope the testing goes well! Memory and CPU (largely from
> more expensive vector distance calculations) are indeed the main factors to
> consider.
> >>
> >> Julie
> >>
> >> On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner <
> michael.wech...@wyona.com> wrote:
> >>> Hi Julie
> >>>
> >>> Thanks again for your feedback!
> >>>
> >>> I will do some more tests with "all-mpnet-base-v2" (768) and
> "all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)
> >>>
> >>> But yes, I could imagine, that eventually it might make sense to allow
> more dimensions than 1024.
> >>>
> >>> Beside memory and  "CPU", are there other limiting factors re more
> dimensions?
> >>>
> >>> Thanks
> >>>
> >>> Michael
> >>>
> >>> Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
> >>>
> >>> Hello Michael, the max number of dimensions is currently hardcoded and
> can't be changed. I could see an argument for increasing the default a bit
> and would be happy to discuss if you'd like to file a JIRA issue. However
> 12288 dimensions still seems high to me, this is much larger than most
> well-established embedding models and could require a lot of memory.
> >>>
> >>> Julie
> >>>
> >>> On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner <
> michael.wech...@wyona.com> wrote:
>  Hi Julie
> 
>  Thanks very much for this link, which is very interesting!
> 
>  Btw, do you have an idea how to increase the default max size of 1024?
> 
>  https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o
> 
>  Thanks
> 
>  Michael
> 
> 
> 
>  Am 14.02.22 um 17:45 schrieb Julie Tibshirani:
> 
>  Hello Michael, I don't have personal experience with these models,
> but I found this article insightful:
> https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
> It evaluates the OpenAI models against a variety of existing models on
> tasks like sentence similarity and text retrieval. Although the other
> models are cheaper and have fewer dimensions, the OpenAI ones perform
> similarly or worse. This got me thinking that they might not be a good
> cost/ effectiveness trade-off, especially the larger ones with 4096 or
> 12288 dimensions.
> 
>  Julie
> 
>  On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner <
> michael.wech...@wyona.com> wrote:
> > Re the OpenAI embedding the following recent paper might be of
> interest
> >
> > https://arxiv.org/pdf/2201.10005.pdf
> >
> > (Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)
> >
> > Thanks
> >
> > Michael
> >
> > Am 13.02.22 um 00:14 schrieb Michael Wechner:
> >
> > Here a concrete example where I combine OpenAI model
> "text-similarity-ada-001" with Lucene vector search
> >
> > INPUT sentence: "What is your age this year?"
> >
> > Result sentences
> >
> > 1) How old are you this year?
> > score 

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Michael Wechner
fair enough, but wouldn't it make sense that one can increase it 
programmatically, e.g.


.setVectorMaxDimension(2028)

?

Thanks

Michael


Am 14.02.22 um 23:34 schrieb Michael Sokolov:

I think we picked the 1024 number as something that seemed so large
nobody would ever want to exceed it! Obviously that was naive. Still
the limit serves as a cautionary point for users; if your vectors are
bigger than this, there is probably a better way to accomplish what
you are after (eg better off-line training to reduce dimensionality).
Is 1024 the magic number? Maybe not, but before increasing I'd like to
see some strong evidence that bigger vectors than that are indeed
useful as part of a search application using Lucene.

-Mike

On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani  wrote:

Sounds good, hope the testing goes well! Memory and CPU (largely from more 
expensive vector distance calculations) are indeed the main factors to consider.

Julie

On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner  
wrote:

Hi Julie

Thanks again for your feedback!

I will do some more tests with "all-mpnet-base-v2" (768) and 
"all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)

But yes, I could imagine, that eventually it might make sense to allow more 
dimensions than 1024.

Beside memory and  "CPU", are there other limiting factors re more dimensions?

Thanks

Michael

Am 14.02.22 um 21:53 schrieb Julie Tibshirani:

Hello Michael, the max number of dimensions is currently hardcoded and can't be 
changed. I could see an argument for increasing the default a bit and would be 
happy to discuss if you'd like to file a JIRA issue. However 12288 dimensions 
still seems high to me, this is much larger than most well-established 
embedding models and could require a lot of memory.

Julie

On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner  
wrote:

Hi Julie

Thanks very much for this link, which is very interesting!

Btw, do you have an idea how to increase the default max size of 1024?

https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

Thanks

Michael



Am 14.02.22 um 17:45 schrieb Julie Tibshirani:

Hello Michael, I don't have personal experience with these models, but I found 
this article insightful: 
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
 It evaluates the OpenAI models against a variety of existing models on tasks 
like sentence similarity and text retrieval. Although the other models are 
cheaper and have fewer dimensions, the OpenAI ones perform similarly or worse. 
This got me thinking that they might not be a good cost/ effectiveness 
trade-off, especially the larger ones with 4096 or 12288 dimensions.

Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner  
wrote:

Re the OpenAI embedding the following recent paper might be of interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:

Here a concrete example where I combine OpenAI model "text-similarity-ada-001" 
with Lucene vector search

INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
score '0.98860765'

2) What was your age last year?
score '0.97811764'

3) What is your age?
score '0.97094905'

4) How old are you?
score '0.9600177'


Result 1 is great and result 2 looks similar, but is not correct from an 
"understanding" point of view and results 3 and 4 are good again.

I understand "similarity" is not the same as "understanding", but I hope it 
makes it clearer what I am looking for :-)

Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example whether the 
following two sentences are similar resp. likely to mean the same thing

"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do not mean the 
same thing

"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for example with SBERT 
(https://sbert.net/docs/usage/semantic_textual_similarity.html)

Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the first 
neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to integrate it?
Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner,  wrote:

Hi

Does anyone have experience using OpenAI embeddings in combination with Lucene 
vector search?

https://beta.openai.com/docs/guides/embeddings

for example comparing performance re vector size

https://api.openai.com/v1/engines/text-similarity-ada-001/embeddings

and

https://api.openai.com/v1/engines/text-similarity-davinci-001/embeddings

?


Thanks

Michael



Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Michael Sokolov
I think we picked the 1024 number as something that seemed so large
nobody would ever want to exceed it! Obviously that was naive. Still
the limit serves as a cautionary point for users; if your vectors are
bigger than this, there is probably a better way to accomplish what
you are after (eg better off-line training to reduce dimensionality).
Is 1024 the magic number? Maybe not, but before increasing I'd like to
see some strong evidence that bigger vectors than that are indeed
useful as part of a search application using Lucene.

-Mike

On Mon, Feb 14, 2022 at 5:08 PM Julie Tibshirani  wrote:
>
> Sounds good, hope the testing goes well! Memory and CPU (largely from more 
> expensive vector distance calculations) are indeed the main factors to 
> consider.
>
> Julie
>
> On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner  
> wrote:
>>
>> Hi Julie
>>
>> Thanks again for your feedback!
>>
>> I will do some more tests with "all-mpnet-base-v2" (768) and 
>> "all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)
>>
>> But yes, I could imagine, that eventually it might make sense to allow more 
>> dimensions than 1024.
>>
>> Beside memory and  "CPU", are there other limiting factors re more 
>> dimensions?
>>
>> Thanks
>>
>> Michael
>>
>> Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
>>
>> Hello Michael, the max number of dimensions is currently hardcoded and can't 
>> be changed. I could see an argument for increasing the default a bit and 
>> would be happy to discuss if you'd like to file a JIRA issue. However 12288 
>> dimensions still seems high to me, this is much larger than most 
>> well-established embedding models and could require a lot of memory.
>>
>> Julie
>>
>> On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner  
>> wrote:
>>>
>>> Hi Julie
>>>
>>> Thanks very much for this link, which is very interesting!
>>>
>>> Btw, do you have an idea how to increase the default max size of 1024?
>>>
>>> https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> Am 14.02.22 um 17:45 schrieb Julie Tibshirani:
>>>
>>> Hello Michael, I don't have personal experience with these models, but I 
>>> found this article insightful: 
>>> https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
>>>  It evaluates the OpenAI models against a variety of existing models on 
>>> tasks like sentence similarity and text retrieval. Although the other 
>>> models are cheaper and have fewer dimensions, the OpenAI ones perform 
>>> similarly or worse. This got me thinking that they might not be a good 
>>> cost/ effectiveness trade-off, especially the larger ones with 4096 or 
>>> 12288 dimensions.
>>>
>>> Julie
>>>
>>> On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner  
>>> wrote:

 Re the OpenAI embedding the following recent paper might be of interest

 https://arxiv.org/pdf/2201.10005.pdf

 (Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)

 Thanks

 Michael

 Am 13.02.22 um 00:14 schrieb Michael Wechner:

 Here a concrete example where I combine OpenAI model 
 "text-similarity-ada-001" with Lucene vector search

 INPUT sentence: "What is your age this year?"

 Result sentences

 1) How old are you this year?
score '0.98860765'

 2) What was your age last year?
score '0.97811764'

 3) What is your age?
score '0.97094905'

 4) How old are you?
score '0.9600177'


 Result 1 is great and result 2 looks similar, but is not correct from an 
 "understanding" point of view and results 3 and 4 are good again.

 I understand "similarity" is not the same as "understanding", but I hope 
 it makes it clearer what I am looking for :-)

 Thanks

 Michael



 Am 12.02.22 um 22:38 schrieb Michael Wechner:

 Hi Alessandro

 I am mainly interested in detecting similarity, for example whether the 
 following two sentences are similar resp. likely to mean the same thing

 "How old are you?"
 "What is your age?"

 and that the following two sentences are not similar, resp. do not mean 
 the same thing

 "How old are you this year?"
 "How old have you been last year?"

 But also performance or how OpenAI embeddings compare for example with 
 SBERT (https://sbert.net/docs/usage/semantic_textual_similarity.html)

 Thanks

 Michael



 Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

 Hi Michael, experience to what extent?
 We have been exploring the area for a while given we contributed the first 
 neural search milestone to Apache Solr.
 What is your curiosity? Performance? Relevance impact? How to integrate it?
 Regards

 On Fri, 11 Feb 2022, 22:38 Michael Wechner,  

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Julie Tibshirani
Sounds good, hope the testing goes well! Memory and CPU (largely from more
expensive vector distance calculations) are indeed the main factors to
consider.

Julie

On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner 
wrote:

> Hi Julie
>
> Thanks again for your feedback!
>
> I will do some more tests with "all-mpnet-base-v2" (768) and
> "all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)
>
> But yes, I could imagine, that eventually it might make sense to allow
> more dimensions than 1024.
>
> Beside memory and  "CPU", are there other limiting factors re more
> dimensions?
>
> Thanks
>
> Michael
>
> Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
>
> Hello Michael, the max number of dimensions is currently hardcoded and
> can't be changed. I could see an argument for increasing the default a bit
> and would be happy to discuss if you'd like to file a JIRA issue.
> However 12288 dimensions still seems high to me, this is much larger than
> most well-established embedding models and could require a lot of memory.
>
> Julie
>
> On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner <
> michael.wech...@wyona.com> wrote:
>
>> Hi Julie
>>
>> Thanks very much for this link, which is very interesting!
>>
>> Btw, do you have an idea how to increase the default max size of 1024?
>>
>> https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 14.02.22 um 17:45 schrieb Julie Tibshirani:
>>
>> Hello Michael, I don't have personal experience with these models, but I
>> found this article insightful:
>> https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
>> It evaluates the OpenAI models against a variety of existing models on
>> tasks like sentence similarity and text retrieval. Although the other
>> models are cheaper and have fewer dimensions, the OpenAI ones perform
>> similarly or worse. This got me thinking that they might not be a good
>> cost/ effectiveness trade-off, especially the larger ones with 4096
>> or 12288 dimensions.
>>
>> Julie
>>
>> On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner <
>> michael.wech...@wyona.com> wrote:
>>
>>> Re the OpenAI embedding the following recent paper might be of interest
>>>
>>> https://arxiv.org/pdf/2201.10005.pdf
>>>
>>> (Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 13.02.22 um 00:14 schrieb Michael Wechner:
>>>
>>> Here a concrete example where I combine OpenAI model
>>> "text-similarity-ada-001" with Lucene vector search
>>>
>>> INPUT sentence: "What is your age this year?"
>>>
>>> Result sentences
>>>
>>> 1) How old are you this year?
>>>score '0.98860765'
>>>
>>> 2) What was your age last year?
>>>score '0.97811764'
>>>
>>> 3) What is your age?
>>>score '0.97094905'
>>>
>>> 4) How old are you?
>>>score '0.9600177'
>>>
>>>
>>> Result 1 is great and result 2 looks similar, but is not correct from an
>>> "understanding" point of view and results 3 and 4 are good again.
>>>
>>> I understand "similarity" is not the same as "understanding", but I hope
>>> it makes it clearer what I am looking for :-)
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> Am 12.02.22 um 22:38 schrieb Michael Wechner:
>>>
>>> Hi Alessandro
>>>
>>> I am mainly interested in detecting similarity, for example whether the
>>> following two sentences are similar resp. likely to mean the same thing
>>>
>>> "How old are you?"
>>> "What is your age?"
>>>
>>> and that the following two sentences are not similar, resp. do not mean
>>> the same thing
>>>
>>> "How old are you this year?"
>>> "How old have you been last year?"
>>>
>>> But also performance or how OpenAI embeddings compare for example with
>>> SBERT (https://sbert.net/docs/usage/semantic_textual_similarity.html)
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:
>>>
>>> Hi Michael, experience to what extent?
>>> We have been exploring the area for a while given we contributed the
>>> first neural search milestone to Apache Solr.
>>> What is your curiosity? Performance? Relevance impact? How to integrate
>>> it?
>>> Regards
>>>
>>> On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
>>> wrote:
>>>
 Hi

 Does anyone have experience using OpenAI embeddings in combination with
 Lucene vector search?

 https://beta.openai.com/docs/guides/embeddings

 for example comparing performance re vector size

 https://api.openai.com/v1/engines/text-similarity-ada-001/embeddings

 and

 https://api.openai.com/v1/engines/text-similarity-davinci-001
 /embeddings

 ?


 Thanks

 Michael

>>>
>>>
>>>
>>>
>>
>


Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Michael Wechner

Hi Julie

Thanks again for your feedback!

I will do some more tests with "all-mpnet-base-v2" (768) and 
"all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-)


But yes, I could imagine, that eventually it might make sense to allow 
more dimensions than 1024.


Beside memory and  "CPU", are there other limiting factors re more 
dimensions?


Thanks

Michael

Am 14.02.22 um 21:53 schrieb Julie Tibshirani:
Hello Michael, the max number of dimensions is currently hardcoded and 
can't be changed. I could see an argument for increasing the default a 
bit and would be happy to discuss if you'd like to file a JIRA issue. 
However 12288 dimensions still seems high to me, this is much larger 
than most well-established embedding models and could require a lot of 
memory.


Julie

On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner 
 wrote:


Hi Julie

Thanks very much for this link, which is very interesting!

Btw, do you have an idea how to increase the default max size of 1024?

https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

Thanks

Michael



Am 14.02.22 um 17:45 schrieb Julie Tibshirani:

Hello Michael, I don't have personal experience with these
models, but I found this article insightful:

https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
It evaluates the OpenAI models against a variety of existing
models on tasks like sentence similarity and text retrieval.
Although the other models are cheaper and have fewer dimensions,
the OpenAI ones perform similarly or worse. This got me thinking
that they might not be a good cost/ effectiveness trade-off,
especially the larger ones with 4096 or 12288 dimensions.

Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner
 wrote:

Re the OpenAI embedding the following recent paper might be
of interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan
24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:

Here a concrete example where I combine OpenAI model
"text-similarity-ada-001" with Lucene vector search

INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not
correct from an "understanding" point of view and results 3
and 4 are good again.

I understand "similarity" is not the same as
"understanding", but I hope it makes it clearer what I am
looking for :-)

Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example
whether the following two sentences are similar resp.
likely to mean the same thing

"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp.
do not mean the same thing

"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for
example with SBERT
(https://sbert.net/docs/usage/semantic_textual_similarity.html)

Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we
contributed the first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How
to integrate it?
Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner,
 wrote:

Hi

Does anyone have experience using OpenAI embeddings in
combination with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size


||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael











Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Julie Tibshirani
Hello Michael, the max number of dimensions is currently hardcoded and
can't be changed. I could see an argument for increasing the default a bit
and would be happy to discuss if you'd like to file a JIRA issue.
However 12288 dimensions still seems high to me, this is much larger than
most well-established embedding models and could require a lot of memory.

Julie

On Mon, Feb 14, 2022 at 12:08 PM Michael Wechner 
wrote:

> Hi Julie
>
> Thanks very much for this link, which is very interesting!
>
> Btw, do you have an idea how to increase the default max size of 1024?
>
> https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o
>
> Thanks
>
> Michael
>
>
>
> Am 14.02.22 um 17:45 schrieb Julie Tibshirani:
>
> Hello Michael, I don't have personal experience with these models, but I
> found this article insightful:
> https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
> It evaluates the OpenAI models against a variety of existing models on
> tasks like sentence similarity and text retrieval. Although the other
> models are cheaper and have fewer dimensions, the OpenAI ones perform
> similarly or worse. This got me thinking that they might not be a good
> cost/ effectiveness trade-off, especially the larger ones with 4096
> or 12288 dimensions.
>
> Julie
>
> On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner 
> wrote:
>
>> Re the OpenAI embedding the following recent paper might be of interest
>>
>> https://arxiv.org/pdf/2201.10005.pdf
>>
>> (Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)
>>
>> Thanks
>>
>> Michael
>>
>> Am 13.02.22 um 00:14 schrieb Michael Wechner:
>>
>> Here a concrete example where I combine OpenAI model
>> "text-similarity-ada-001" with Lucene vector search
>>
>> INPUT sentence: "What is your age this year?"
>>
>> Result sentences
>>
>> 1) How old are you this year?
>>score '0.98860765'
>>
>> 2) What was your age last year?
>>score '0.97811764'
>>
>> 3) What is your age?
>>score '0.97094905'
>>
>> 4) How old are you?
>>score '0.9600177'
>>
>>
>> Result 1 is great and result 2 looks similar, but is not correct from an
>> "understanding" point of view and results 3 and 4 are good again.
>>
>> I understand "similarity" is not the same as "understanding", but I hope
>> it makes it clearer what I am looking for :-)
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 12.02.22 um 22:38 schrieb Michael Wechner:
>>
>> Hi Alessandro
>>
>> I am mainly interested in detecting similarity, for example whether the
>> following two sentences are similar resp. likely to mean the same thing
>>
>> "How old are you?"
>> "What is your age?"
>>
>> and that the following two sentences are not similar, resp. do not mean
>> the same thing
>>
>> "How old are you this year?"
>> "How old have you been last year?"
>>
>> But also performance or how OpenAI embeddings compare for example with
>> SBERT (https://sbert.net/docs/usage/semantic_textual_similarity.html)
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:
>>
>> Hi Michael, experience to what extent?
>> We have been exploring the area for a while given we contributed the
>> first neural search milestone to Apache Solr.
>> What is your curiosity? Performance? Relevance impact? How to integrate
>> it?
>> Regards
>>
>> On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
>> wrote:
>>
>>> Hi
>>>
>>> Does anyone have experience using OpenAI embeddings in combination with
>>> Lucene vector search?
>>>
>>> https://beta.openai.com/docs/guides/embeddings
>>>
>>> for example comparing performance re vector size
>>>
>>> https://api.openai.com/v1/engines/text-similarity-ada-001/embeddings
>>>
>>> and
>>>
>>> https://api.openai.com/v1/engines/text-similarity-davinci-001/embeddings
>>>
>>> ?
>>>
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>
>>
>>
>>
>


Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Michael Wechner

Hi Julie

Thanks very much for this link, which is very interesting!

Btw, do you have an idea how to increase the default max size of 1024?

https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o

Thanks

Michael



Am 14.02.22 um 17:45 schrieb Julie Tibshirani:
Hello Michael, I don't have personal experience with these models, but 
I found this article insightful: 
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9. 
It evaluates the OpenAI models against a variety of existing models on 
tasks like sentence similarity and text retrieval. Although the other 
models are cheaper and have fewer dimensions, the OpenAI ones perform 
similarly or worse. This got me thinking that they might not be a good 
cost/ effectiveness trade-off, especially the larger ones with 4096 
or 12288 dimensions.


Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner 
 wrote:


Re the OpenAI embedding the following recent paper might be of
interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:

Here a concrete example where I combine OpenAI model
"text-similarity-ada-001" with Lucene vector search

INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not correct
from an "understanding" point of view and results 3 and 4 are
good again.

I understand "similarity" is not the same as "understanding", but
I hope it makes it clearer what I am looking for :-)

Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example
whether the following two sentences are similar resp. likely to
mean the same thing

"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do
not mean the same thing

"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for
example with SBERT
(https://sbert.net/docs/usage/semantic_textual_similarity.html)

Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we
contributed the first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to
integrate it?
Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner,
 wrote:

Hi

Does anyone have experience using OpenAI embeddings in
combination with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size


||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael









Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Julie Tibshirani
Hello Michael, I don't have personal experience with these models, but I
found this article insightful:
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9.
It evaluates the OpenAI models against a variety of existing models on
tasks like sentence similarity and text retrieval. Although the other
models are cheaper and have fewer dimensions, the OpenAI ones perform
similarly or worse. This got me thinking that they might not be a good
cost/ effectiveness trade-off, especially the larger ones with 4096
or 12288 dimensions.

Julie

On Sun, Feb 13, 2022 at 1:55 AM Michael Wechner 
wrote:

> Re the OpenAI embedding the following recent paper might be of interest
>
> https://arxiv.org/pdf/2201.10005.pdf
>
> (Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)
>
> Thanks
>
> Michael
>
> Am 13.02.22 um 00:14 schrieb Michael Wechner:
>
> Here a concrete example where I combine OpenAI model
> "text-similarity-ada-001" with Lucene vector search
>
> INPUT sentence: "What is your age this year?"
>
> Result sentences
>
> 1) How old are you this year?
>score '0.98860765'
>
> 2) What was your age last year?
>score '0.97811764'
>
> 3) What is your age?
>score '0.97094905'
>
> 4) How old are you?
>score '0.9600177'
>
>
> Result 1 is great and result 2 looks similar, but is not correct from an
> "understanding" point of view and results 3 and 4 are good again.
>
> I understand "similarity" is not the same as "understanding", but I hope
> it makes it clearer what I am looking for :-)
>
> Thanks
>
> Michael
>
>
>
> Am 12.02.22 um 22:38 schrieb Michael Wechner:
>
> Hi Alessandro
>
> I am mainly interested in detecting similarity, for example whether the
> following two sentences are similar resp. likely to mean the same thing
>
> "How old are you?"
> "What is your age?"
>
> and that the following two sentences are not similar, resp. do not mean
> the same thing
>
> "How old are you this year?"
> "How old have you been last year?"
>
> But also performance or how OpenAI embeddings compare for example with
> SBERT (https://sbert.net/docs/usage/semantic_textual_similarity.html)
>
> Thanks
>
> Michael
>
>
>
> Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:
>
> Hi Michael, experience to what extent?
> We have been exploring the area for a while given we contributed the first
> neural search milestone to Apache Solr.
> What is your curiosity? Performance? Relevance impact? How to integrate
> it?
> Regards
>
> On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
> wrote:
>
>> Hi
>>
>> Does anyone have experience using OpenAI embeddings in combination with
>> Lucene vector search?
>>
>> https://beta.openai.com/docs/guides/embeddings
>>
>> for example comparing performance re vector size
>>
>> https://api.openai.com/v1/engines/text-similarity-ada-001/embeddings
>>
>> and
>>
>> https://api.openai.com/v1/engines/text-similarity-davinci-001/embeddings
>>
>> ?
>>
>>
>> Thanks
>>
>> Michael
>>
>
>
>
>


Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-13 Thread Michael Wechner

Re the OpenAI embedding the following recent paper might be of interest

https://arxiv.org/pdf/2201.10005.pdf

(Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022)

Thanks

Michael

Am 13.02.22 um 00:14 schrieb Michael Wechner:
Here a concrete example where I combine OpenAI model 
"text-similarity-ada-001" with Lucene vector search


INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not correct from 
an "understanding" point of view and results 3 and 4 are good again.


I understand "similarity" is not the same as "understanding", but I 
hope it makes it clearer what I am looking for :-)


Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example whether 
the following two sentences are similar resp. likely to mean the same 
thing


"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do not 
mean the same thing


"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for example 
with SBERT 
(https://sbert.net/docs/usage/semantic_textual_similarity.html)


Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the 
first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to 
integrate it?

Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
 wrote:


Hi

Does anyone have experience using OpenAI embeddings in
combination with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size

||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael







Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-12 Thread Michael Wechner
Here a concrete example where I combine OpenAI model 
"text-similarity-ada-001" with Lucene vector search


INPUT sentence: "What is your age this year?"

Result sentences

1) How old are you this year?
   score '0.98860765'

2) What was your age last year?
   score '0.97811764'

3) What is your age?
   score '0.97094905'

4) How old are you?
   score '0.9600177'


Result 1 is great and result 2 looks similar, but is not correct from an 
"understanding" point of view and results 3 and 4 are good again.


I understand "similarity" is not the same as "understanding", but I hope 
it makes it clearer what I am looking for :-)


Thanks

Michael



Am 12.02.22 um 22:38 schrieb Michael Wechner:

Hi Alessandro

I am mainly interested in detecting similarity, for example whether 
the following two sentences are similar resp. likely to mean the same 
thing


"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do not 
mean the same thing


"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for example with 
SBERT (https://sbert.net/docs/usage/semantic_textual_similarity.html)


Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the 
first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to 
integrate it?

Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
 wrote:


Hi

Does anyone have experience using OpenAI embeddings in
combination with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size

||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael





Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-12 Thread Michael Wechner

Hi Alessandro

I am mainly interested in detecting similarity, for example whether the 
following two sentences are similar resp. likely to mean the same thing


"How old are you?"
"What is your age?"

and that the following two sentences are not similar, resp. do not mean 
the same thing


"How old are you this year?"
"How old have you been last year?"

But also performance or how OpenAI embeddings compare for example with 
SBERT (https://sbert.net/docs/usage/semantic_textual_similarity.html)


Thanks

Michael



Am 12.02.22 um 20:41 schrieb Alessandro Benedetti:

Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the 
first neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to 
integrate it?

Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
 wrote:


Hi

Does anyone have experience using OpenAI embeddings in combination
with Lucene vector search?

https://beta.openai.com/docs/guides/embeddings|

for example comparing performance re vector size

||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings

and


||https://api.openai.com/v1/engines/text-similarity-davinci-001||/embeddings

?

||
|Thanks

Michael



Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-12 Thread Alessandro Benedetti
Hi Michael, experience to what extent?
We have been exploring the area for a while given we contributed the first
neural search milestone to Apache Solr.
What is your curiosity? Performance? Relevance impact? How to integrate it?
Regards

On Fri, 11 Feb 2022, 22:38 Michael Wechner, 
wrote:

> Hi
>
> Does anyone have experience using OpenAI embeddings in combination with
> Lucene vector search?
>
> https://beta.openai.com/docs/guides/embeddings
>
> for example comparing performance re vector size
>
> https://api.openai.com/v1/engines/text-similarity-ada-001/embeddings
>
> and
>
> https://api.openai.com/v1/engines/text-similarity-davinci-001/embeddings
>
> ?
>
>
> Thanks
>
> Michael
>