I've been monitoring various discussions on Pull Requests about changing
the max number of dimensions allowed for Lucene HNSW vectors:

https://github.com/apache/lucene/pull/12191

https://github.com/apache/lucene/issues/11507


I would like to set up a discussion and potentially a vote about this.

I have seen some strong opposition from a few people but a majority of
favor in this direction.


*Motivation*

We were discussing in the Solr slack channel with Ishan Chattopadhyaya, Marcus
Eagan, and David Smiley about some neural search integrations in Solr:
https://github.com/openai/chatgpt-retrieval-plugin


*Proposal*

No hard limit at all.

As for many other Lucene areas, users will be allowed to push the system to
the limit of their resources and get terrible performances or crashes if
they want.


*What we are NOT discussing*

- Quality and scalability of the HNSW algorithm

- dimensionality reduction

- strategies to fit in an arbitrary self-imposed limit


*Benefits*

- users can use the models they want to generate vectors

- removal of an arbitrary limit that blocks some integrations


*Cons*

 - if you go for vectors with high dimensions, there's no guarantee you get
acceptable performance for your use case



I want to keep it simple, right now in many Lucene areas, you can push the
system to not acceptable performance/ crashes.

For example, we don't limit the number of docs per index to an arbitrary
maximum of N, you push how many docs you like and if they are too much for
your system, you get terrible performance/crashes/whatever.


Limits caused by primitive java types will stay there behind the scene, and
that's acceptable, but I would prefer to not have arbitrary hard-coded ones
that may limit the software usability and integration which is extremely
important for a library.


I strongly encourage people to add benefits and cons, that I missed (I am
sure I missed some of them, but wanted to keep it simple)


Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>

Reply via email to