Thanks Mike for the insight! What would be the next steps then? I see agreement but also the necessity of identifying a candidate MAX.
Should create a VOTE thread, where we propose some values with a justification and we vote? In this way we can create a pull request and merge relatively soon. Cheers On Tue, 4 Apr 2023, 14:47 Michael Wechner, <michael.wech...@wyona.com> wrote: > IIUC we all agree that the limit could be raised, but we need some solid > reasoning what limit makes sense, resp. why do we set this particular limit > (e.g. 2048), right? > > Thanks > > Michael > > > Am 04.04.23 um 15:32 schrieb Michael McCandless: > > > I am not in favor of just doubling it as suggested by some people, I > would ideally prefer a solution that remains there to a decent extent, > rather than having to modifying it anytime someone requires a higher limit. > > The problem with this approach is it is a one-way door, once released. We > would not be able to lower the limit again in the future without possibly > breaking some applications. > > > For example, we don't limit the number of docs per index to an > arbitrary maximum of N, you push how many docs you like and if they are too > much for your system, you get terrible performance/crashes/whatever. > > Correction: we do check this limit and throw a specific exception now: > https://github.com/apache/lucene/issues/6905 > > +1 to raise the limit, but not remove it. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti <a.benede...@sease.io> > wrote: > >> ... and what would be the next limit? >> I guess we'll need to motivate it better than the 1024 one. >> I appreciate the fact that a limit is pretty much wanted by everyone but >> I suspect we'll need some solid foundation for deciding the amount (and it >> should be high enough to avoid continuous changes) >> >> Cheers >> >> On Sun, 2 Apr 2023, 07:29 Michael Wechner, <michael.wech...@wyona.com> >> wrote: >> >>> btw, what was the reasoning to set the current limit to 1024? >>> >>> Thanks >>> >>> Michael >>> >>> Am 01.04.23 um 14:47 schrieb Michael Sokolov: >>> >>> I'm also in favor of raising this limit. We do see some datasets with >>> higher than 1024 dims. I also think we need to keep a limit. For example we >>> currently need to keep all the vectors in RAM while indexing and we want to >>> be able to support reasonable numbers of vectors in an index segment. Also >>> we don't know what innovations might come down the road. Maybe someday we >>> want to do product quantization and enforce that (k, m) both fit in a byte >>> -- we wouldn't be able to do that if a vector's dimension were to exceed >>> 32K. >>> >>> On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti < >>> a.benede...@sease.io> wrote: >>> >>>> I am also curious what would be the worst-case scenario if we remove >>>> the constant at all (so automatically the limit becomes the Java >>>> Integer.MAX_VALUE). >>>> i.e. >>>> right now if you exceed the limit you get: >>>> >>>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) { >>>>> throw new IllegalArgumentException( >>>>> "cannot index vectors with dimension greater than " + ByteVectorValues >>>>> .MAX_DIMENSIONS); >>>>> } >>>> >>>> >>>> in relation to: >>>> >>>>> These limits allow us to >>>>> better tune our data structures, prevent overflows, help ensure we >>>>> have good test coverage, etc. >>>> >>>> >>>> I agree 100% especially for typing stuff properly and avoiding resource >>>> waste here and there, but I am not entirely sure this is the case for the >>>> current implementation i.e. do we have optimizations in place that assume >>>> the max dimension to be 1024? >>>> If I missed that (and I likely have), I of course suggest the >>>> contribution should not just blindly remove the limit, but do it >>>> appropriately. >>>> I am not in favor of just doubling it as suggested by some people, I >>>> would ideally prefer a solution that remains there to a decent extent, >>>> rather than having to modifying it anytime someone requires a higher limit. >>>> >>>> Cheers >>>> >>>> -------------------------- >>>> *Alessandro Benedetti* >>>> Director @ Sease Ltd. >>>> *Apache Lucene/Solr Committer* >>>> *Apache Solr PMC Member* >>>> >>>> e-mail: a.benede...@sease.io >>>> >>>> >>>> *Sease* - Information Retrieval Applied >>>> Consulting | Training | Open Source >>>> >>>> Website: Sease.io <http://sease.io/> >>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>>> <https://twitter.com/seaseltd> | Youtube >>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>>> <https://github.com/seaseltd> >>>> >>>> >>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner < >>>> michael.wech...@wyona.com> wrote: >>>> >>>>> OpenAI reduced their size to 1536 dimensions >>>>> >>>>> https://openai.com/blog/new-and-improved-embedding-model >>>>> >>>>> so 2048 would work :-) >>>>> >>>>> but other services do provide also higher dimensions with sometimes >>>>> slightly better accuracy >>>>> >>>>> Thanks >>>>> >>>>> Michael >>>>> >>>>> >>>>> Am 31.03.23 um 14:45 schrieb Adrien Grand: >>>>> > I'm supportive of bumping the limit on the maximum dimension for >>>>> > vectors to something that is above what the majority of users need, >>>>> > but I'd like to keep a limit. We have limits for other things like >>>>> the >>>>> > max number of docs per index, the max term length, the max number of >>>>> > dimensions of points, etc. and there are a few things that we don't >>>>> > have limits on that I wish we had limits on. These limits allow us to >>>>> > better tune our data structures, prevent overflows, help ensure we >>>>> > have good test coverage, etc. >>>>> > >>>>> > That said, these other limits we have in place are quite high. E.g. >>>>> > the 32kB term limit, nobody would ever type a 32kB term in a text >>>>> box. >>>>> > Likewise for the max of 8 dimensions for points: a segment cannot >>>>> > possibly have 2 splits per dimension on average if it doesn't have >>>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions >>>>> > than 8 would likely defeat the point of indexing. In contrast, our >>>>> > limit on the number of dimensions of vectors seems to be under what >>>>> > some users would like, and while I understand the performance >>>>> argument >>>>> > against bumping the limit, it doesn't feel to me like something that >>>>> > would be so bad that we need to prevent users from using numbers of >>>>> > dimensions in the low thousands, e.g. top-k KNN searches would still >>>>> > look at a very small subset of the full dataset. >>>>> > >>>>> > So overall, my vote would be to bump the limit to 2048 as suggested >>>>> by >>>>> > Mayya on the issue that you linked. >>>>> > >>>>> > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner >>>>> > <michael.wech...@wyona.com> wrote: >>>>> >> Thanks Alessandro for summarizing the discussion below! >>>>> >> >>>>> >> I understand that there is no clear reasoning re what is the best >>>>> embedding size, whereas I think heuristic approaches like described by the >>>>> following link can be helpful >>>>> >> >>>>> >> >>>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter >>>>> >> >>>>> >> Having said this, we see various embedding services providing >>>>> higher dimensions than 1024, like for example OpenAI, Cohere and Aleph >>>>> Alpha. >>>>> >> >>>>> >> And it would be great if we could run benchmarks without having to >>>>> recompile Lucene ourselves. >>>>> >> >>>>> >> Therefore I would to suggest to either increase the limit or even >>>>> better to remove the limit and add a disclaimer, that people should be >>>>> aware of possible crashes etc. >>>>> >> >>>>> >> Thanks >>>>> >> >>>>> >> Michael >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti: >>>>> >> >>>>> >> >>>>> >> I've been monitoring various discussions on Pull Requests about >>>>> changing the max number of dimensions allowed for Lucene HNSW vectors: >>>>> >> >>>>> >> https://github.com/apache/lucene/pull/12191 >>>>> >> >>>>> >> https://github.com/apache/lucene/issues/11507 >>>>> >> >>>>> >> >>>>> >> I would like to set up a discussion and potentially a vote about >>>>> this. >>>>> >> >>>>> >> I have seen some strong opposition from a few people but a majority >>>>> of favor in this direction. >>>>> >> >>>>> >> >>>>> >> Motivation >>>>> >> >>>>> >> We were discussing in the Solr slack channel with Ishan >>>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search >>>>> integrations in Solr: >>>>> https://github.com/openai/chatgpt-retrieval-plugin >>>>> >> >>>>> >> >>>>> >> Proposal >>>>> >> >>>>> >> No hard limit at all. >>>>> >> >>>>> >> As for many other Lucene areas, users will be allowed to push the >>>>> system to the limit of their resources and get terrible performances or >>>>> crashes if they want. >>>>> >> >>>>> >> >>>>> >> What we are NOT discussing >>>>> >> >>>>> >> - Quality and scalability of the HNSW algorithm >>>>> >> >>>>> >> - dimensionality reduction >>>>> >> >>>>> >> - strategies to fit in an arbitrary self-imposed limit >>>>> >> >>>>> >> >>>>> >> Benefits >>>>> >> >>>>> >> - users can use the models they want to generate vectors >>>>> >> >>>>> >> - removal of an arbitrary limit that blocks some integrations >>>>> >> >>>>> >> >>>>> >> Cons >>>>> >> >>>>> >> - if you go for vectors with high dimensions, there's no >>>>> guarantee you get acceptable performance for your use case >>>>> >> >>>>> >> >>>>> >> >>>>> >> I want to keep it simple, right now in many Lucene areas, you can >>>>> push the system to not acceptable performance/ crashes. >>>>> >> >>>>> >> For example, we don't limit the number of docs per index to an >>>>> arbitrary maximum of N, you push how many docs you like and if they are >>>>> too >>>>> much for your system, you get terrible performance/crashes/whatever. >>>>> >> >>>>> >> >>>>> >> Limits caused by primitive java types will stay there behind the >>>>> scene, and that's acceptable, but I would prefer to not have arbitrary >>>>> hard-coded ones that may limit the software usability and integration >>>>> which >>>>> is extremely important for a library. >>>>> >> >>>>> >> >>>>> >> I strongly encourage people to add benefits and cons, that I missed >>>>> (I am sure I missed some of them, but wanted to keep it simple) >>>>> >> >>>>> >> >>>>> >> Cheers >>>>> >> >>>>> >> -------------------------- >>>>> >> Alessandro Benedetti >>>>> >> Director @ Sease Ltd. >>>>> >> Apache Lucene/Solr Committer >>>>> >> Apache Solr PMC Member >>>>> >> >>>>> >> e-mail: a.benede...@sease.io >>>>> >> >>>>> >> >>>>> >> Sease - Information Retrieval Applied >>>>> >> Consulting | Training | Open Source >>>>> >> >>>>> >> Website: Sease.io >>>>> >> LinkedIn | Twitter | Youtube | Github >>>>> >> >>>>> >> >>>>> > >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >>>>> >>> >