I am also curious what would be the worst-case scenario if we remove the constant at all (so automatically the limit becomes the Java Integer.MAX_VALUE). i.e. right now if you exceed the limit you get:
> if (dimension > ByteVectorValues.MAX_DIMENSIONS) { > throw new IllegalArgumentException( > "cannot index vectors with dimension greater than " + ByteVectorValues. > MAX_DIMENSIONS); > } in relation to: > These limits allow us to > better tune our data structures, prevent overflows, help ensure we > have good test coverage, etc. I agree 100% especially for typing stuff properly and avoiding resource waste here and there, but I am not entirely sure this is the case for the current implementation i.e. do we have optimizations in place that assume the max dimension to be 1024? If I missed that (and I likely have), I of course suggest the contribution should not just blindly remove the limit, but do it appropriately. I am not in favor of just doubling it as suggested by some people, I would ideally prefer a solution that remains there to a decent extent, rather than having to modifying it anytime someone requires a higher limit. Cheers -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wech...@wyona.com> wrote: > OpenAI reduced their size to 1536 dimensions > > https://openai.com/blog/new-and-improved-embedding-model > > so 2048 would work :-) > > but other services do provide also higher dimensions with sometimes > slightly better accuracy > > Thanks > > Michael > > > Am 31.03.23 um 14:45 schrieb Adrien Grand: > > I'm supportive of bumping the limit on the maximum dimension for > > vectors to something that is above what the majority of users need, > > but I'd like to keep a limit. We have limits for other things like the > > max number of docs per index, the max term length, the max number of > > dimensions of points, etc. and there are a few things that we don't > > have limits on that I wish we had limits on. These limits allow us to > > better tune our data structures, prevent overflows, help ensure we > > have good test coverage, etc. > > > > That said, these other limits we have in place are quite high. E.g. > > the 32kB term limit, nobody would ever type a 32kB term in a text box. > > Likewise for the max of 8 dimensions for points: a segment cannot > > possibly have 2 splits per dimension on average if it doesn't have > > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions > > than 8 would likely defeat the point of indexing. In contrast, our > > limit on the number of dimensions of vectors seems to be under what > > some users would like, and while I understand the performance argument > > against bumping the limit, it doesn't feel to me like something that > > would be so bad that we need to prevent users from using numbers of > > dimensions in the low thousands, e.g. top-k KNN searches would still > > look at a very small subset of the full dataset. > > > > So overall, my vote would be to bump the limit to 2048 as suggested by > > Mayya on the issue that you linked. > > > > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner > > <michael.wech...@wyona.com> wrote: > >> Thanks Alessandro for summarizing the discussion below! > >> > >> I understand that there is no clear reasoning re what is the best > embedding size, whereas I think heuristic approaches like described by the > following link can be helpful > >> > >> > https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter > >> > >> Having said this, we see various embedding services providing higher > dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha. > >> > >> And it would be great if we could run benchmarks without having to > recompile Lucene ourselves. > >> > >> Therefore I would to suggest to either increase the limit or even > better to remove the limit and add a disclaimer, that people should be > aware of possible crashes etc. > >> > >> Thanks > >> > >> Michael > >> > >> > >> > >> > >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti: > >> > >> > >> I've been monitoring various discussions on Pull Requests about > changing the max number of dimensions allowed for Lucene HNSW vectors: > >> > >> https://github.com/apache/lucene/pull/12191 > >> > >> https://github.com/apache/lucene/issues/11507 > >> > >> > >> I would like to set up a discussion and potentially a vote about this. > >> > >> I have seen some strong opposition from a few people but a majority of > favor in this direction. > >> > >> > >> Motivation > >> > >> We were discussing in the Solr slack channel with Ishan Chattopadhyaya, > Marcus Eagan, and David Smiley about some neural search integrations in > Solr: https://github.com/openai/chatgpt-retrieval-plugin > >> > >> > >> Proposal > >> > >> No hard limit at all. > >> > >> As for many other Lucene areas, users will be allowed to push the > system to the limit of their resources and get terrible performances or > crashes if they want. > >> > >> > >> What we are NOT discussing > >> > >> - Quality and scalability of the HNSW algorithm > >> > >> - dimensionality reduction > >> > >> - strategies to fit in an arbitrary self-imposed limit > >> > >> > >> Benefits > >> > >> - users can use the models they want to generate vectors > >> > >> - removal of an arbitrary limit that blocks some integrations > >> > >> > >> Cons > >> > >> - if you go for vectors with high dimensions, there's no guarantee > you get acceptable performance for your use case > >> > >> > >> > >> I want to keep it simple, right now in many Lucene areas, you can push > the system to not acceptable performance/ crashes. > >> > >> For example, we don't limit the number of docs per index to an > arbitrary maximum of N, you push how many docs you like and if they are too > much for your system, you get terrible performance/crashes/whatever. > >> > >> > >> Limits caused by primitive java types will stay there behind the scene, > and that's acceptable, but I would prefer to not have arbitrary hard-coded > ones that may limit the software usability and integration which is > extremely important for a library. > >> > >> > >> I strongly encourage people to add benefits and cons, that I missed (I > am sure I missed some of them, but wanted to keep it simple) > >> > >> > >> Cheers > >> > >> -------------------------- > >> Alessandro Benedetti > >> Director @ Sease Ltd. > >> Apache Lucene/Solr Committer > >> Apache Solr PMC Member > >> > >> e-mail: a.benede...@sease.io > >> > >> > >> Sease - Information Retrieval Applied > >> Consulting | Training | Open Source > >> > >> Website: Sease.io > >> LinkedIn | Twitter | Youtube | Github > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >