Re: [Proposal] Remove max number of dimensions for KNN vectors

Alessandro Benedetti Fri, 31 Mar 2023 08:57:06 -0700

I am also curious what would be the worst-case scenario if we remove the
constant at all (so automatically the limit becomes the Java
Integer.MAX_VALUE).
i.e.
right now if you exceed the limit you get:


> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
> throw new IllegalArgumentException(
> "cannot index vectors with dimension greater than " + ByteVectorValues.
> MAX_DIMENSIONS);
> }


in relation to:

> These limits allow us to
> better tune our data structures, prevent overflows, help ensure we
> have good test coverage, etc.


I agree 100% especially for typing stuff properly and avoiding resource
waste here and there, but I am not entirely sure this is the case for the
current implementation i.e. do we have optimizations in place that assume
the max dimension to be 1024?
If I missed that (and I likely have), I of course suggest the contribution
should not just blindly remove the limit, but do it appropriately.
I am not in favor of just doubling it as suggested by some people, I would
ideally prefer a solution that remains there to a decent extent, rather
than having to modifying it anytime someone requires a higher limit.

Cheers

--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: [email protected]


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Fri, 31 Mar 2023 at 16:12, Michael Wechner <[email protected]>
wrote:

> OpenAI reduced their size to 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> so 2048 would work :-)
>
> but other services do provide also higher dimensions with sometimes
> slightly better accuracy
>
> Thanks
>
> Michael
>
>
> Am 31.03.23 um 14:45 schrieb Adrien Grand:
> > I'm supportive of bumping the limit on the maximum dimension for
> > vectors to something that is above what the majority of users need,
> > but I'd like to keep a limit. We have limits for other things like the
> > max number of docs per index, the max term length, the max number of
> > dimensions of points, etc. and there are a few things that we don't
> > have limits on that I wish we had limits on. These limits allow us to
> > better tune our data structures, prevent overflows, help ensure we
> > have good test coverage, etc.
> >
> > That said, these other limits we have in place are quite high. E.g.
> > the 32kB term limit, nobody would ever type a 32kB term in a text box.
> > Likewise for the max of 8 dimensions for points: a segment cannot
> > possibly have 2 splits per dimension on average if it doesn't have
> > 512*2^(8*2)=34M docs, a sizable dataset already, so more dimensions
> > than 8 would likely defeat the point of indexing. In contrast, our
> > limit on the number of dimensions of vectors seems to be under what
> > some users would like, and while I understand the performance argument
> > against bumping the limit, it doesn't feel to me like something that
> > would be so bad that we need to prevent users from using numbers of
> > dimensions in the low thousands, e.g. top-k KNN searches would still
> > look at a very small subset of the full dataset.
> >
> > So overall, my vote would be to bump the limit to 2048 as suggested by
> > Mayya on the issue that you linked.
> >
> > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
> > <[email protected]> wrote:
> >> Thanks Alessandro for summarizing the discussion below!
> >>
> >> I understand that there is no clear reasoning re what is the best
> embedding size, whereas I think heuristic approaches like described by the
> following link can be helpful
> >>
> >>
> https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
> >>
> >> Having said this, we see various embedding services providing higher
> dimensions than 1024, like for example OpenAI, Cohere and Aleph Alpha.
> >>
> >> And it would be great if we could run benchmarks without having to
> recompile Lucene ourselves.
> >>
> >> Therefore I would to suggest to either increase the limit or even
> better to remove the limit and add a disclaimer, that people should be
> aware of possible crashes etc.
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>
> >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
> >>
> >>
> >> I've been monitoring various discussions on Pull Requests about
> changing the max number of dimensions allowed for Lucene HNSW vectors:
> >>
> >> https://github.com/apache/lucene/pull/12191
> >>
> >> https://github.com/apache/lucene/issues/11507
> >>
> >>
> >> I would like to set up a discussion and potentially a vote about this.
> >>
> >> I have seen some strong opposition from a few people but a majority of
> favor in this direction.
> >>
> >>
> >> Motivation
> >>
> >> We were discussing in the Solr slack channel with Ishan Chattopadhyaya,
> Marcus Eagan, and David Smiley about some neural search integrations in
> Solr: https://github.com/openai/chatgpt-retrieval-plugin
> >>
> >>
> >> Proposal
> >>
> >> No hard limit at all.
> >>
> >> As for many other Lucene areas, users will be allowed to push the
> system to the limit of their resources and get terrible performances or
> crashes if they want.
> >>
> >>
> >> What we are NOT discussing
> >>
> >> - Quality and scalability of the HNSW algorithm
> >>
> >> - dimensionality reduction
> >>
> >> - strategies to fit in an arbitrary self-imposed limit
> >>
> >>
> >> Benefits
> >>
> >> - users can use the models they want to generate vectors
> >>
> >> - removal of an arbitrary limit that blocks some integrations
> >>
> >>
> >> Cons
> >>
> >>   - if you go for vectors with high dimensions, there's no guarantee
> you get acceptable performance for your use case
> >>
> >>
> >>
> >> I want to keep it simple, right now in many Lucene areas, you can push
> the system to not acceptable performance/ crashes.
> >>
> >> For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
> >>
> >>
> >> Limits caused by primitive java types will stay there behind the scene,
> and that's acceptable, but I would prefer to not have arbitrary hard-coded
> ones that may limit the software usability and integration which is
> extremely important for a library.
> >>
> >>
> >> I strongly encourage people to add benefits and cons, that I missed (I
> am sure I missed some of them, but wanted to keep it simple)
> >>
> >>
> >> Cheers
> >>
> >> --------------------------
> >> Alessandro Benedetti
> >> Director @ Sease Ltd.
> >> Apache Lucene/Solr Committer
> >> Apache Solr PMC Member
> >>
> >> e-mail: [email protected]
> >>
> >>
> >> Sease - Information Retrieval Applied
> >> Consulting | Training | Open Source
> >>
> >> Website: Sease.io
> >> LinkedIn | Twitter | Youtube | Github
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to