OpenAI reduced their size to 1536 dimensions

https://openai.com/blog/new-and-improved-embedding-model

so 2048 would work :-)

but other services do provide also higher dimensions with sometimes slightly better accuracy

Thanks

Michael


Am 31.03.23 um 14:45 schrieb Adrien Grand:
I'm supportive of bumping the limit on the maximum dimension for
vectors to something that is above what the majority of users need,
but I'd like to keep a limit. We have limits for other things like the
max number of docs per index, the max term length, the max number of
dimensions of points, etc. and there are a few things that we don't
have limits on that I wish we had limits on. These limits allow us to
better tune our data structures, prevent overflows, help ensure we
have good test coverage, etc.

That said, these other limits we have in place are quite high. E.g.
the 32kB term limit, nobody would ever type a 32kB term in a text box.
Likewise for the max of 8 dimensions for points: a segment cannot
possibly have 2 splits per dimension on average if it doesn't have
512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
than 8 would likely defeat the point of indexing. In contrast, our
limit on the number of dimensions of vectors seems to be under what
some users would like, and while I understand the performance argument
against bumping the limit, it doesn't feel to me like something that
would be so bad that we need to prevent users from using numbers of
dimensions in the low thousands, e.g. top-k KNN searches would still
look at a very small subset of the full dataset.

So overall, my vote would be to bump the limit to 2048 as suggested by
Mayya on the issue that you linked.

On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
<michael.wech...@wyona.com> wrote:
Thanks Alessandro for summarizing the discussion below!

I understand that there is no clear reasoning re what is the best embedding 
size, whereas I think heuristic approaches like described by the following link 
can be helpful

https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter

Having said this, we see various embedding services providing higher dimensions 
than 1024, like for example OpenAI, Cohere and Aleph Alpha.

And it would be great if we could run benchmarks without having to recompile 
Lucene ourselves.

Therefore I would to suggest to either increase the limit or even better to 
remove the limit and add a disclaimer, that people should be aware of possible 
crashes etc.

Thanks

Michael




Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:


I've been monitoring various discussions on Pull Requests about changing the 
max number of dimensions allowed for Lucene HNSW vectors:

https://github.com/apache/lucene/pull/12191

https://github.com/apache/lucene/issues/11507


I would like to set up a discussion and potentially a vote about this.

I have seen some strong opposition from a few people but a majority of favor in 
this direction.


Motivation

We were discussing in the Solr slack channel with Ishan Chattopadhyaya, Marcus 
Eagan, and David Smiley about some neural search integrations in Solr: 
https://github.com/openai/chatgpt-retrieval-plugin


Proposal

No hard limit at all.

As for many other Lucene areas, users will be allowed to push the system to the 
limit of their resources and get terrible performances or crashes if they want.


What we are NOT discussing

- Quality and scalability of the HNSW algorithm

- dimensionality reduction

- strategies to fit in an arbitrary self-imposed limit


Benefits

- users can use the models they want to generate vectors

- removal of an arbitrary limit that blocks some integrations


Cons

  - if you go for vectors with high dimensions, there's no guarantee you get 
acceptable performance for your use case



I want to keep it simple, right now in many Lucene areas, you can push the 
system to not acceptable performance/ crashes.

For example, we don't limit the number of docs per index to an arbitrary 
maximum of N, you push how many docs you like and if they are too much for your 
system, you get terrible performance/crashes/whatever.


Limits caused by primitive java types will stay there behind the scene, and 
that's acceptable, but I would prefer to not have arbitrary hard-coded ones 
that may limit the software usability and integration which is extremely 
important for a library.


I strongly encourage people to add benefits and cons, that I missed (I am sure 
I missed some of them, but wanted to keep it simple)


Cheers

--------------------------
Alessandro Benedetti
Director @ Sease Ltd.
Apache Lucene/Solr Committer
Apache Solr PMC Member

e-mail: a.benede...@sease.io


Sease - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io
LinkedIn | Twitter | Youtube | Github





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to