+1 to raising the limit. Maybe in future performance problems can be mitigated with optimisations or hardware acceleration (GPUs) etc.
On Sat, 1 Apr, 2023, 6:18 pm Michael Sokolov, <msoko...@gmail.com> wrote: > I'm also in favor of raising this limit. We do see some datasets with > higher than 1024 dims. I also think we need to keep a limit. For example we > currently need to keep all the vectors in RAM while indexing and we want to > be able to support reasonable numbers of vectors in an index segment. Also > we don't know what innovations might come down the road. Maybe someday we > want to do product quantization and enforce that (k, m) both fit in a byte > -- we wouldn't be able to do that if a vector's dimension were to exceed > 32K. > > On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti < > a.benede...@sease.io> wrote: > >> I am also curious what would be the worst-case scenario if we remove the >> constant at all (so automatically the limit becomes the Java >> Integer.MAX_VALUE). >> i.e. >> right now if you exceed the limit you get: >> >>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) { >>> throw new IllegalArgumentException( >>> "cannot index vectors with dimension greater than " + ByteVectorValues. >>> MAX_DIMENSIONS); >>> } >> >> >> in relation to: >> >>> These limits allow us to >>> better tune our data structures, prevent overflows, help ensure we >>> have good test coverage, etc. >> >> >> I agree 100% especially for typing stuff properly and avoiding resource >> waste here and there, but I am not entirely sure this is the case for the >> current implementation i.e. do we have optimizations in place that assume >> the max dimension to be 1024? >> If I missed that (and I likely have), I of course suggest the >> contribution should not just blindly remove the limit, but do it >> appropriately. >> I am not in favor of just doubling it as suggested by some people, I >> would ideally prefer a solution that remains there to a decent extent, >> rather than having to modifying it anytime someone requires a higher limit. >> >> Cheers >> >> -------------------------- >> *Alessandro Benedetti* >> Director @ Sease Ltd. >> *Apache Lucene/Solr Committer* >> *Apache Solr PMC Member* >> >> e-mail: a.benede...@sease.io >> >> >> *Sease* - Information Retrieval Applied >> Consulting | Training | Open Source >> >> Website: Sease.io <http://sease.io/> >> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >> <https://twitter.com/seaseltd> | Youtube >> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >> <https://github.com/seaseltd> >> >> >> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wech...@wyona.com> >> wrote: >> >>> OpenAI reduced their size to 1536 dimensions >>> >>> https://openai.com/blog/new-and-improved-embedding-model >>> >>> so 2048 would work :-) >>> >>> but other services do provide also higher dimensions with sometimes >>> slightly better accuracy >>> >>> Thanks >>> >>> Michael >>> >>> >>> Am 31.03.23 um 14:45 schrieb Adrien Grand: >>> > I'm supportive of bumping the limit on the maximum dimension for >>> > vectors to something that is above what the majority of users need, >>> > but I'd like to keep a limit. We have limits for other things like the >>> > max number of docs per index, the max term length, the max number of >>> > dimensions of points, etc. and there are a few things that we don't >>> > have limits on that I wish we had limits on. These limits allow us to >>> > better tune our data structures, prevent overflows, help ensure we >>> > have good test coverage, etc. >>> > >>> > That said, these other limits we have in place are quite high. E.g. >>> > the 32kB term limit, nobody would ever type a 32kB term in a text box. >>> > Likewise for the max of 8 dimensions for points: a segment cannot >>> > possibly have 2 splits per dimension on average if it doesn't have >>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions >>> > than 8 would likely defeat the point of indexing. In contrast, our >>> > limit on the number of dimensions of vectors seems to be under what >>> > some users would like, and while I understand the performance argument >>> > against bumping the limit, it doesn't feel to me like something that >>> > would be so bad that we need to prevent users from using numbers of >>> > dimensions in the low thousands, e.g. top-k KNN searches would still >>> > look at a very small subset of the full dataset. >>> > >>> > So overall, my vote would be to bump the limit to 2048 as suggested by >>> > Mayya on the issue that you linked. >>> > >>> > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner >>> > <michael.wech...@wyona.com> wrote: >>> >> Thanks Alessandro for summarizing the discussion below! >>> >> >>> >> I understand that there is no clear reasoning re what is the best >>> embedding size, whereas I think heuristic approaches like described by the >>> following link can be helpful >>> >> >>> >> >>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter >>> >> >>> >> Having said this, we see various embedding services providing higher >>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha. >>> >> >>> >> And it would be great if we could run benchmarks without having to >>> recompile Lucene ourselves. >>> >> >>> >> Therefore I would to suggest to either increase the limit or even >>> better to remove the limit and add a disclaimer, that people should be >>> aware of possible crashes etc. >>> >> >>> >> Thanks >>> >> >>> >> Michael >>> >> >>> >> >>> >> >>> >> >>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti: >>> >> >>> >> >>> >> I've been monitoring various discussions on Pull Requests about >>> changing the max number of dimensions allowed for Lucene HNSW vectors: >>> >> >>> >> https://github.com/apache/lucene/pull/12191 >>> >> >>> >> https://github.com/apache/lucene/issues/11507 >>> >> >>> >> >>> >> I would like to set up a discussion and potentially a vote about this. >>> >> >>> >> I have seen some strong opposition from a few people but a majority >>> of favor in this direction. >>> >> >>> >> >>> >> Motivation >>> >> >>> >> We were discussing in the Solr slack channel with Ishan >>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search >>> integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin >>> >> >>> >> >>> >> Proposal >>> >> >>> >> No hard limit at all. >>> >> >>> >> As for many other Lucene areas, users will be allowed to push the >>> system to the limit of their resources and get terrible performances or >>> crashes if they want. >>> >> >>> >> >>> >> What we are NOT discussing >>> >> >>> >> - Quality and scalability of the HNSW algorithm >>> >> >>> >> - dimensionality reduction >>> >> >>> >> - strategies to fit in an arbitrary self-imposed limit >>> >> >>> >> >>> >> Benefits >>> >> >>> >> - users can use the models they want to generate vectors >>> >> >>> >> - removal of an arbitrary limit that blocks some integrations >>> >> >>> >> >>> >> Cons >>> >> >>> >> - if you go for vectors with high dimensions, there's no guarantee >>> you get acceptable performance for your use case >>> >> >>> >> >>> >> >>> >> I want to keep it simple, right now in many Lucene areas, you can >>> push the system to not acceptable performance/ crashes. >>> >> >>> >> For example, we don't limit the number of docs per index to an >>> arbitrary maximum of N, you push how many docs you like and if they are too >>> much for your system, you get terrible performance/crashes/whatever. >>> >> >>> >> >>> >> Limits caused by primitive java types will stay there behind the >>> scene, and that's acceptable, but I would prefer to not have arbitrary >>> hard-coded ones that may limit the software usability and integration which >>> is extremely important for a library. >>> >> >>> >> >>> >> I strongly encourage people to add benefits and cons, that I missed >>> (I am sure I missed some of them, but wanted to keep it simple) >>> >> >>> >> >>> >> Cheers >>> >> >>> >> -------------------------- >>> >> Alessandro Benedetti >>> >> Director @ Sease Ltd. >>> >> Apache Lucene/Solr Committer >>> >> Apache Solr PMC Member >>> >> >>> >> e-mail: a.benede...@sease.io >>> >> >>> >> >>> >> Sease - Information Retrieval Applied >>> >> Consulting | Training | Open Source >>> >> >>> >> Website: Sease.io >>> >> LinkedIn | Twitter | Youtube | Github >>> >> >>> >> >>> > >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>>