Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Wechner Fri, 31 Mar 2023 05:38:34 -0700

Thanks Alessandro for summarizing the discussion below!

I understand that there is no clear reasoning re what is the bestembedding size, whereas I think heuristic approaches like described bythe following link can be helpful


https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter

Having said this, we see various embedding services providing higherdimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.

And it would be great if we could run benchmarks without having torecompile Lucene ourselves.

Therefore I would to suggest to either increase the limit or even betterto remove the limit and add a disclaimer, that people should be aware ofpossible crashes etc.


Thanks

Michael




Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:

I've been monitoring various discussions on Pull Requests aboutchanging the max number of dimensions allowed for Lucene HNSW vectors:
https://github.com/apache/lucene/pull/12191

https://github.com/apache/lucene/issues/11507


I would like to set up a discussion and potentially a vote about this.
I have seen some strong opposition from a few people but a majority offavor in this direction.
*Motivation*
We were discussing in the Solr slack channel with IshanChattopadhyaya, Marcus Eagan, and David Smiley about some neuralsearch integrations in Solr:https://github.com/openai/chatgpt-retrieval-plugin
*Proposal*

No hard limit at all.
As for many other Lucene areas, users will be allowed to push thesystem to the limit of their resources and get terrible performancesor crashes if they want.
*What we are NOT discussing*

- Quality and scalability of the HNSW algorithm

- dimensionality reduction

- strategies to fit in an arbitrary self-imposed limit


*Benefits*

- users can use the models they want to generate vectors

- removal of an arbitrary limit that blocks some integrations


*Cons*
- if you go for vectors with high dimensions, there's no guaranteeyou get acceptable performance for your use case
*
*

*
*
I want to keep it simple, right now in many Lucene areas, you can pushthe system to not acceptable performance/ crashes.
For example, we don't limit the number of docs per index to anarbitrary maximum of N, you push how many docs you like and if theyare too much for your system, you get terribleperformance/crashes/whatever.
Limits caused by primitive java types will stay there behind thescene, and that's acceptable, but I would prefer to not have arbitraryhard-coded ones that may limit the software usability and integrationwhich is extremely important for a library.
I strongly encourage people to add benefits and cons, that I missed (Iam sure I missed some of them, but wanted to keep it simple)
Cheers

--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter<https://twitter.com/seaseltd> | Youtube<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github<https://github.com/seaseltd>

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to