Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-20 Thread Michael Wechner
btw, I have done some tests now with the sentence-transformer models "all-roberta-large-v1" and "all-mpnet-base-v2" https://huggingface.co/sentence-transformers/all-roberta-large-v1 https://huggingface.co/sentence-transformers/all-mpnet-base-v2 whereas also see

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Robert Muir
On Tue, Feb 15, 2022 at 2:33 PM Michael Wechner wrote: > > There seems to be no light at the end of the tunnel for the JDK vector > api, I think OpenJDK will incubate this API until the sun supernovas and > java is dead :) > It is frustrating, as that could give current implementation a needed >

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Michael Wechner
Am 15.02.22 um 19:48 schrieb Robert Muir: Sure, but lucene should be able to have limits. We have this discussion with every single limit we attempt to implement :) There will always be extreme use cases using too many dimensions or whatever. It is open source! I think if what you are doing

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Robert Muir
Sure, but lucene should be able to have limits. We have this discussion with every single limit we attempt to implement :) There will always be extreme use cases using too many dimensions or whatever. It is open source! I think if what you are doing is strange enough, you can modify the sources.

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Michael Wechner
I understand, but if Lucene itself would allow to overwrite the default max size programmatically, then I think it should be clear that you do this at your own risk :-) Thanks for the links to your blog posts, which sound very interesting. Thanks Michael Am 15.02.22 um 17:25 schrieb

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Alessandro Benedetti
I believe it could make sense, but as Michael pointed out in the Jira ticket related to the Solr integration, then we'll get complaints like "I set it to 1.000.000 and my Solr instance doesn't work anymore" (I kept everything super simple just to simulate a realistic scenario). So I tend to agree

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-15 Thread Michael Wechner
fair enough, but wouldn't it make sense that one can increase it programmatically, e.g. .setVectorMaxDimension(2028) ? Thanks Michael Am 14.02.22 um 23:34 schrieb Michael Sokolov: I think we picked the 1024 number as something that seemed so large nobody would ever want to exceed it!

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Michael Sokolov
I think we picked the 1024 number as something that seemed so large nobody would ever want to exceed it! Obviously that was naive. Still the limit serves as a cautionary point for users; if your vectors are bigger than this, there is probably a better way to accomplish what you are after (eg

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Julie Tibshirani
Sounds good, hope the testing goes well! Memory and CPU (largely from more expensive vector distance calculations) are indeed the main factors to consider. Julie On Mon, Feb 14, 2022 at 1:02 PM Michael Wechner wrote: > Hi Julie > > Thanks again for your feedback! > > I will do some more tests

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Michael Wechner
Hi Julie Thanks again for your feedback! I will do some more tests with "all-mpnet-base-v2" (768) and "all-roberta-large-v1" (1024), so 1024 is enough for me for the moment :-) But yes, I could imagine, that eventually it might make sense to allow more dimensions than 1024. Beside memory

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Julie Tibshirani
Hello Michael, the max number of dimensions is currently hardcoded and can't be changed. I could see an argument for increasing the default a bit and would be happy to discuss if you'd like to file a JIRA issue. However 12288 dimensions still seems high to me, this is much larger than most

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Michael Wechner
Hi Julie Thanks very much for this link, which is very interesting! Btw, do you have an idea how to increase the default max size of 1024? https://lists.apache.org/thread/hyb6w5c4x5rjt34k3w7zqn3yp5wvf33o Thanks Michael Am 14.02.22 um 17:45 schrieb Julie Tibshirani: Hello Michael, I don't

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-14 Thread Julie Tibshirani
Hello Michael, I don't have personal experience with these models, but I found this article insightful: https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9. It evaluates the OpenAI models against a variety of existing

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-13 Thread Michael Wechner
Re the OpenAI embedding the following recent paper might be of interest https://arxiv.org/pdf/2201.10005.pdf (Text and Code Embeddings by Contrastive Pre-Training, Jan 24, 2022) Thanks Michael Am 13.02.22 um 00:14 schrieb Michael Wechner: Here a concrete example where I combine OpenAI model

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-12 Thread Michael Wechner
Here a concrete example where I combine OpenAI model "text-similarity-ada-001" with Lucene vector search INPUT sentence: "What is your age this year?" Result sentences 1) How old are you this year?    score '0.98860765' 2) What was your age last year?    score '0.97811764' 3) What is your

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-12 Thread Michael Wechner
Hi Alessandro I am mainly interested in detecting similarity, for example whether the following two sentences are similar resp. likely to mean the same thing "How old are you?" "What is your age?" and that the following two sentences are not similar, resp. do not mean the same thing "How

Re: Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-12 Thread Alessandro Benedetti
Hi Michael, experience to what extent? We have been exploring the area for a while given we contributed the first neural search milestone to Apache Solr. What is your curiosity? Performance? Relevance impact? How to integrate it? Regards On Fri, 11 Feb 2022, 22:38 Michael Wechner, wrote: > Hi >

Experience re OpenAI embeddings in combination with Lucene vector search

2022-02-11 Thread Michael Wechner
Hi Does anyone have experience using OpenAI embeddings in combination with Lucene vector search? https://beta.openai.com/docs/guides/embeddings| for example comparing performance re vector size ||https://api.openai.com/v1/engines/|||text-similarity-ada-001|/embeddings and