Re: [Proposal] Remove max number of dimensions for KNN vectors

Ishan Chattopadhyaya Sat, 01 Apr 2023 07:02:17 -0700

+1 to raising the limit. Maybe in future performance problems can be
mitigated with optimisations or hardware acceleration (GPUs) etc.


On Sat, 1 Apr, 2023, 6:18 pm Michael Sokolov, <msoko...@gmail.com> wrote:

> I'm also in favor of raising this limit. We do see some datasets with
> higher than 1024 dims. I also think we need to keep a limit. For example we
> currently need to keep all the vectors in RAM while indexing and we want to
> be able to support reasonable numbers of vectors in an index segment. Also
> we don't know what innovations might come down the road. Maybe someday we
> want to do product quantization and enforce that (k, m) both fit in a byte
> -- we wouldn't be able to do that if a vector's dimension were to exceed
> 32K.
>
> On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti <
> a.benede...@sease.io> wrote:
>
>> I am also curious what would be the worst-case scenario if we remove the
>> constant at all (so automatically the limit becomes the Java
>> Integer.MAX_VALUE).
>> i.e.
>> right now if you exceed the limit you get:
>>
>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>> throw new IllegalArgumentException(
>>> "cannot index vectors with dimension greater than " + ByteVectorValues.
>>> MAX_DIMENSIONS);
>>> }
>>
>>
>> in relation to:
>>
>>> These limits allow us to
>>> better tune our data structures, prevent overflows, help ensure we
>>> have good test coverage, etc.
>>
>>
>> I agree 100% especially for typing stuff properly and avoiding resource
>> waste here and there, but I am not entirely sure this is the case for the
>> current implementation i.e. do we have optimizations in place that assume
>> the max dimension to be 1024?
>> If I missed that (and I likely have), I of course suggest the
>> contribution should not just blindly remove the limit, but do it
>> appropriately.
>> I am not in favor of just doubling it as suggested by some people, I
>> would ideally prefer a solution that remains there to a decent extent,
>> rather than having to modifying it anytime someone requires a higher limit.
>>
>> Cheers
>>
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benede...@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
>>
>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <michael.wech...@wyona.com>
>> wrote:
>>
>>> OpenAI reduced their size to 1536 dimensions
>>>
>>> https://openai.com/blog/new-and-improved-embedding-model
>>>
>>> so 2048 would work :-)
>>>
>>> but other services do provide also higher dimensions with sometimes
>>> slightly better accuracy
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>> > I'm supportive of bumping the limit on the maximum dimension for
>>> > vectors to something that is above what the majority of users need,
>>> > but I'd like to keep a limit. We have limits for other things like the
>>> > max number of docs per index, the max term length, the max number of
>>> > dimensions of points, etc. and there are a few things that we don't
>>> > have limits on that I wish we had limits on. These limits allow us to
>>> > better tune our data structures, prevent overflows, help ensure we
>>> > have good test coverage, etc.
>>> >
>>> > That said, these other limits we have in place are quite high. E.g.
>>> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>> > possibly have 2 splits per dimension on average if it doesn't have
>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>> > limit on the number of dimensions of vectors seems to be under what
>>> > some users would like, and while I understand the performance argument
>>> > against bumping the limit, it doesn't feel to me like something that
>>> > would be so bad that we need to prevent users from using numbers of
>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>> > look at a very small subset of the full dataset.
>>> >
>>> > So overall, my vote would be to bump the limit to 2048 as suggested by
>>> > Mayya on the issue that you linked.
>>> >
>>> > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
>>> > <michael.wech...@wyona.com> wrote:
>>> >> Thanks Alessandro for summarizing the discussion below!
>>> >>
>>> >> I understand that there is no clear reasoning re what is the best
>>> embedding size, whereas I think heuristic approaches like described by the
>>> following link can be helpful
>>> >>
>>> >>
>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>> >>
>>> >> Having said this, we see various embedding services providing higher
>>> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
>>> >>
>>> >> And it would be great if we could run benchmarks without having to
>>> recompile Lucene ourselves.
>>> >>
>>> >> Therefore I would to suggest to either increase the limit or even
>>> better to remove the limit and add a disclaimer, that people should be
>>> aware of possible crashes etc.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>> >>
>>> >>
>>> >> I've been monitoring various discussions on Pull Requests about
>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>> >>
>>> >> https://github.com/apache/lucene/pull/12191
>>> >>
>>> >> https://github.com/apache/lucene/issues/11507
>>> >>
>>> >>
>>> >> I would like to set up a discussion and potentially a vote about this.
>>> >>
>>> >> I have seen some strong opposition from a few people but a majority
>>> of favor in this direction.
>>> >>
>>> >>
>>> >> Motivation
>>> >>
>>> >> We were discussing in the Solr slack channel with Ishan
>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>> integrations in Solr: https://github.com/openai/chatgpt-retrieval-plugin
>>> >>
>>> >>
>>> >> Proposal
>>> >>
>>> >> No hard limit at all.
>>> >>
>>> >> As for many other Lucene areas, users will be allowed to push the
>>> system to the limit of their resources and get terrible performances or
>>> crashes if they want.
>>> >>
>>> >>
>>> >> What we are NOT discussing
>>> >>
>>> >> - Quality and scalability of the HNSW algorithm
>>> >>
>>> >> - dimensionality reduction
>>> >>
>>> >> - strategies to fit in an arbitrary self-imposed limit
>>> >>
>>> >>
>>> >> Benefits
>>> >>
>>> >> - users can use the models they want to generate vectors
>>> >>
>>> >> - removal of an arbitrary limit that blocks some integrations
>>> >>
>>> >>
>>> >> Cons
>>> >>
>>> >>   - if you go for vectors with high dimensions, there's no guarantee
>>> you get acceptable performance for your use case
>>> >>
>>> >>
>>> >>
>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>> push the system to not acceptable performance/ crashes.
>>> >>
>>> >> For example, we don't limit the number of docs per index to an
>>> arbitrary maximum of N, you push how many docs you like and if they are too
>>> much for your system, you get terrible performance/crashes/whatever.
>>> >>
>>> >>
>>> >> Limits caused by primitive java types will stay there behind the
>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>> hard-coded ones that may limit the software usability and integration which
>>> is extremely important for a library.
>>> >>
>>> >>
>>> >> I strongly encourage people to add benefits and cons, that I missed
>>> (I am sure I missed some of them, but wanted to keep it simple)
>>> >>
>>> >>
>>> >> Cheers
>>> >>
>>> >> --------------------------
>>> >> Alessandro Benedetti
>>> >> Director @ Sease Ltd.
>>> >> Apache Lucene/Solr Committer
>>> >> Apache Solr PMC Member
>>> >>
>>> >> e-mail: a.benede...@sease.io
>>> >>
>>> >>
>>> >> Sease - Information Retrieval Applied
>>> >> Consulting | Training | Open Source
>>> >>
>>> >> Website: Sease.io
>>> >> LinkedIn | Twitter | Youtube | Github
>>> >>
>>> >>
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to