Re: [Proposal] Remove max number of dimensions for KNN vectors

Alessandro Benedetti Wed, 05 Apr 2023 03:36:14 -0700

Thanks Mike for the insight!

What would be the next steps then?
I see agreement but also the necessity of identifying a candidate MAX.


Should create a VOTE thread, where we propose some values with a
justification and we vote?

In this way we can create a pull request and merge relatively soon.

Cheers

On Tue, 4 Apr 2023, 14:47 Michael Wechner, <[email protected]>
wrote:

> IIUC we all agree that the limit could be raised, but we need some solid
> reasoning what limit makes sense, resp. why do we set this particular limit
> (e.g. 2048), right?
>
> Thanks
>
> Michael
>
>
> Am 04.04.23 um 15:32 schrieb Michael McCandless:
>
> > I am not in favor of just doubling it as suggested by some people, I
> would ideally prefer a solution that remains there to a decent extent,
> rather than having to modifying it anytime someone requires a higher limit.
>
> The problem with this approach is it is a one-way door, once released.  We
> would not be able to lower the limit again in the future without possibly
> breaking some applications.
>
> > For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
>
> Correction: we do check this limit and throw a specific exception now:
> https://github.com/apache/lucene/issues/6905
>
> +1 to raise the limit, but not remove it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti <[email protected]>
> wrote:
>
>> ... and what would be the next limit?
>> I guess we'll need to motivate it better than the 1024 one.
>> I appreciate the fact that a limit is pretty much wanted by everyone but
>> I suspect we'll need some solid foundation for deciding the amount (and it
>> should be high enough to avoid continuous changes)
>>
>> Cheers
>>
>> On Sun, 2 Apr 2023, 07:29 Michael Wechner, <[email protected]>
>> wrote:
>>
>>> btw, what was the reasoning to set the current limit to 1024?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>>
>>> I'm also in favor of raising this limit. We do see some datasets with
>>> higher than 1024 dims. I also think we need to keep a limit. For example we
>>> currently need to keep all the vectors in RAM while indexing and we want to
>>> be able to support reasonable numbers of vectors in an index segment. Also
>>> we don't know what innovations might come down the road. Maybe someday we
>>> want to do product quantization and enforce that (k, m) both fit in a byte
>>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>>> 32K.
>>>
>>> On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti <
>>> [email protected]> wrote:
>>>
>>>> I am also curious what would be the worst-case scenario if we remove
>>>> the constant at all (so automatically the limit becomes the Java
>>>> Integer.MAX_VALUE).
>>>> i.e.
>>>> right now if you exceed the limit you get:
>>>>
>>>>> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
>>>>> throw new IllegalArgumentException(
>>>>> "cannot index vectors with dimension greater than " + ByteVectorValues
>>>>> .MAX_DIMENSIONS);
>>>>> }
>>>>
>>>>
>>>> in relation to:
>>>>
>>>>> These limits allow us to
>>>>> better tune our data structures, prevent overflows, help ensure we
>>>>> have good test coverage, etc.
>>>>
>>>>
>>>> I agree 100% especially for typing stuff properly and avoiding resource
>>>> waste here and there, but I am not entirely sure this is the case for the
>>>> current implementation i.e. do we have optimizations in place that assume
>>>> the max dimension to be 1024?
>>>> If I missed that (and I likely have), I of course suggest the
>>>> contribution should not just blindly remove the limit, but do it
>>>> appropriately.
>>>> I am not in favor of just doubling it as suggested by some people, I
>>>> would ideally prefer a solution that remains there to a decent extent,
>>>> rather than having to modifying it anytime someone requires a higher limit.
>>>>
>>>> Cheers
>>>>
>>>> --------------------------
>>>> *Alessandro Benedetti*
>>>> Director @ Sease Ltd.
>>>> *Apache Lucene/Solr Committer*
>>>> *Apache Solr PMC Member*
>>>>
>>>> e-mail: [email protected]
>>>>
>>>>
>>>> *Sease* - Information Retrieval Applied
>>>> Consulting | Training | Open Source
>>>>
>>>> Website: Sease.io <http://sease.io/>
>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>> <https://twitter.com/seaseltd> | Youtube
>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>> <https://github.com/seaseltd>
>>>>
>>>>
>>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner <
>>>> [email protected]> wrote:
>>>>
>>>>> OpenAI reduced their size to 1536 dimensions
>>>>>
>>>>> https://openai.com/blog/new-and-improved-embedding-model
>>>>>
>>>>> so 2048 would work :-)
>>>>>
>>>>> but other services do provide also higher dimensions with sometimes
>>>>> slightly better accuracy
>>>>>
>>>>> Thanks
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 31.03.23 um 14:45 schrieb Adrien Grand:
>>>>> > I'm supportive of bumping the limit on the maximum dimension for
>>>>> > vectors to something that is above what the majority of users need,
>>>>> > but I'd like to keep a limit. We have limits for other things like
>>>>> the
>>>>> > max number of docs per index, the max term length, the max number of
>>>>> > dimensions of points, etc. and there are a few things that we don't
>>>>> > have limits on that I wish we had limits on. These limits allow us to
>>>>> > better tune our data structures, prevent overflows, help ensure we
>>>>> > have good test coverage, etc.
>>>>> >
>>>>> > That said, these other limits we have in place are quite high. E.g.
>>>>> > the 32kB term limit, nobody would ever type a 32kB term in a text
>>>>> box.
>>>>> > Likewise for the max of 8 dimensions for points: a segment cannot
>>>>> > possibly have 2 splits per dimension on average if it doesn't have
>>>>> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
>>>>> > than 8 would likely defeat the point of indexing. In contrast, our
>>>>> > limit on the number of dimensions of vectors seems to be under what
>>>>> > some users would like, and while I understand the performance
>>>>> argument
>>>>> > against bumping the limit, it doesn't feel to me like something that
>>>>> > would be so bad that we need to prevent users from using numbers of
>>>>> > dimensions in the low thousands, e.g. top-k KNN searches would still
>>>>> > look at a very small subset of the full dataset.
>>>>> >
>>>>> > So overall, my vote would be to bump the limit to 2048 as suggested
>>>>> by
>>>>> > Mayya on the issue that you linked.
>>>>> >
>>>>> > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
>>>>> > <[email protected]> wrote:
>>>>> >> Thanks Alessandro for summarizing the discussion below!
>>>>> >>
>>>>> >> I understand that there is no clear reasoning re what is the best
>>>>> embedding size, whereas I think heuristic approaches like described by the
>>>>> following link can be helpful
>>>>> >>
>>>>> >>
>>>>> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>>>> >>
>>>>> >> Having said this, we see various embedding services providing
>>>>> higher dimensions than 1024, like for example OpenAI, Cohere and Aleph
>>>>> Alpha.
>>>>> >>
>>>>> >> And it would be great if we could run benchmarks without having to
>>>>> recompile Lucene ourselves.
>>>>> >>
>>>>> >> Therefore I would to suggest to either increase the limit or even
>>>>> better to remove the limit and add a disclaimer, that people should be
>>>>> aware of possible crashes etc.
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> Michael
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
>>>>> >>
>>>>> >>
>>>>> >> I've been monitoring various discussions on Pull Requests about
>>>>> changing the max number of dimensions allowed for Lucene HNSW vectors:
>>>>> >>
>>>>> >> https://github.com/apache/lucene/pull/12191
>>>>> >>
>>>>> >> https://github.com/apache/lucene/issues/11507
>>>>> >>
>>>>> >>
>>>>> >> I would like to set up a discussion and potentially a vote about
>>>>> this.
>>>>> >>
>>>>> >> I have seen some strong opposition from a few people but a majority
>>>>> of favor in this direction.
>>>>> >>
>>>>> >>
>>>>> >> Motivation
>>>>> >>
>>>>> >> We were discussing in the Solr slack channel with Ishan
>>>>> Chattopadhyaya, Marcus Eagan, and David Smiley about some neural search
>>>>> integrations in Solr:
>>>>> https://github.com/openai/chatgpt-retrieval-plugin
>>>>> >>
>>>>> >>
>>>>> >> Proposal
>>>>> >>
>>>>> >> No hard limit at all.
>>>>> >>
>>>>> >> As for many other Lucene areas, users will be allowed to push the
>>>>> system to the limit of their resources and get terrible performances or
>>>>> crashes if they want.
>>>>> >>
>>>>> >>
>>>>> >> What we are NOT discussing
>>>>> >>
>>>>> >> - Quality and scalability of the HNSW algorithm
>>>>> >>
>>>>> >> - dimensionality reduction
>>>>> >>
>>>>> >> - strategies to fit in an arbitrary self-imposed limit
>>>>> >>
>>>>> >>
>>>>> >> Benefits
>>>>> >>
>>>>> >> - users can use the models they want to generate vectors
>>>>> >>
>>>>> >> - removal of an arbitrary limit that blocks some integrations
>>>>> >>
>>>>> >>
>>>>> >> Cons
>>>>> >>
>>>>> >>   - if you go for vectors with high dimensions, there's no
>>>>> guarantee you get acceptable performance for your use case
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> I want to keep it simple, right now in many Lucene areas, you can
>>>>> push the system to not acceptable performance/ crashes.
>>>>> >>
>>>>> >> For example, we don't limit the number of docs per index to an
>>>>> arbitrary maximum of N, you push how many docs you like and if they are 
>>>>> too
>>>>> much for your system, you get terrible performance/crashes/whatever.
>>>>> >>
>>>>> >>
>>>>> >> Limits caused by primitive java types will stay there behind the
>>>>> scene, and that's acceptable, but I would prefer to not have arbitrary
>>>>> hard-coded ones that may limit the software usability and integration 
>>>>> which
>>>>> is extremely important for a library.
>>>>> >>
>>>>> >>
>>>>> >> I strongly encourage people to add benefits and cons, that I missed
>>>>> (I am sure I missed some of them, but wanted to keep it simple)
>>>>> >>
>>>>> >>
>>>>> >> Cheers
>>>>> >>
>>>>> >> --------------------------
>>>>> >> Alessandro Benedetti
>>>>> >> Director @ Sease Ltd.
>>>>> >> Apache Lucene/Solr Committer
>>>>> >> Apache Solr PMC Member
>>>>> >>
>>>>> >> e-mail: [email protected]
>>>>> >>
>>>>> >>
>>>>> >> Sease - Information Retrieval Applied
>>>>> >> Consulting | Training | Open Source
>>>>> >>
>>>>> >> Website: Sease.io
>>>>> >> LinkedIn | Twitter | Youtube | Github
>>>>> >>
>>>>> >>
>>>>> >
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to