Re: [Proposal] Remove max number of dimensions for KNN vectors

Dawid Weiss Sat, 08 Apr 2023 22:32:04 -0700

Can we set up a branch in which the limit is bumped to 2048, then have
a realistic, free data set (wikipedia sample or something) that has,
say, 5 million docs and vectors created using public data (glove
pre-trained embeddings or the like)? We then could run indexing on the
same hardware with 512, 1024 and 2048 and see what the numbers, limits
and behavior actually are.


I can help in writing this but not until after Easter.


Dawid

On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <[email protected]> wrote:
>
> As Dawid pointed out earlier on this thread, this is the rule for
> Apache projects: a single -1 vote on a code change is a veto and
> cannot be overridden. Furthermore, Robert is one of the people on this
> project who worked the most on debugging subtle bugs, making Lucene
> more robust and improving our test framework, so I'm listening when he
> voices quality concerns.
>
> The argument against removing/raising the limit that resonates with me
> the most is that it is a one-way door. As MikeS highlighted earlier on
> this thread, implementations may want to take advantage of the fact
> that there is a limit at some point too. This is why I don't want to
> remove the limit and would prefer a slight increase, such as 2048 as
> suggested in the original issue, which would enable most of the things
> that users who have been asking about raising the limit would like to
> do.
>
> I agree that the merge-time memory usage and slow indexing rate are
> not great. But it's still possible to index multi-million vector
> datasets with a 4GB heap without hitting OOMEs regardless of the
> number of dimensions, and the feedback I'm seeing is that many users
> are still interested in indexing multi-million vector datasets despite
> the slow indexing rate. I wish we could do better, and vector indexing
> is certainly more expert than text indexing, but it still is usable in
> my opinion. I understand how giving Lucene more information about
> vectors prior to indexing (e.g. clustering information as Jim pointed
> out) could help make merging faster and more memory-efficient, but I
> would really like to avoid making it a requirement for indexing
> vectors as it also makes this feature much harder to use.
>
> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
> <[email protected]> wrote:
> >
> > I am very attentive to listen opinions but I am un-convinced here and I an 
> > not sure that a single person opinion should be allowed to be detrimental 
> > for such an important project.
> >
> > The limit as far as I know is literally just raising an exception.
> > Removing it won't alter in any way the current performance for users in low 
> > dimensional space.
> > Removing it will just enable more users to use Lucene.
> >
> > If new users in certain situations will be unhappy with the performance, 
> > they may contribute improvements.
> > This is how you make progress.
> >
> > If it's a reputation thing, trust me that not allowing users to play with 
> > high dimensional space will equally damage it.
> >
> > To me it's really a no brainer.
> > Removing the limit and enable people to use high dimensional vectors will 
> > take minutes.
> > Improving the hnsw implementation can take months.
> > Pick one to begin with...
> >
> > And there's no-one paying me here, no company interest whatsoever, actually 
> > I pay people to contribute, I am just convinced it's a good idea.
> >
> >
> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <[email protected]> wrote:
> >>
> >> I disagree with your categorization. I put in plenty of work and
> >> experienced plenty of pain myself, writing tests and fighting these
> >> issues, after i saw that, two releases in a row, vector indexing fell
> >> over and hit integer overflows etc on small datasets:
> >>
> >> https://github.com/apache/lucene/pull/11905
> >>
> >> Attacking me isn't helping the situation.
> >>
> >> PS: when i said the "one guy who wrote the code" I didn't mean it in
> >> any kind of demeaning fashion really. I meant to describe the current
> >> state of usability with respect to indexing a few million docs with
> >> high dimensions. You can scroll up the thread and see that at least
> >> one other committer on the project experienced similar pain as me.
> >> Then, think about users who aren't committers trying to use the
> >> functionality!
> >>
> >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <[email protected]> wrote:
> >> >
> >> > What you said about increasing dimensions requiring a bigger ram buffer 
> >> > on merge is wrong. That's the point I was trying to make. Your concerns 
> >> > about merge costs are not wrong, but your conclusion that we need to 
> >> > limit dimensions is not justified.
> >> >
> >> > You complain that hnsw sucks it doesn't scale, but when I show it scales 
> >> > linearly with dimension you just ignore that and complain about 
> >> > something entirely different.
> >> >
> >> > You demand that people run all kinds of tests to prove you wrong but 
> >> > when they do, you don't listen and you won't put in the work yourself or 
> >> > complain that it's too hard.
> >> >
> >> > Then you complain about people not meeting you half way. Wow
> >> >
> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <[email protected]> wrote:
> >> >>
> >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
> >> >> <[email protected]> wrote:
> >> >> >
> >> >> > What exactly do you consider reasonable?
> >> >>
> >> >> Let's begin a real discussion by being HONEST about the current
> >> >> status. Please put politically correct or your own company's wishes
> >> >> aside, we know it's not in a good state.
> >> >>
> >> >> Current status is the one guy who wrote the code can set a
> >> >> multi-gigabyte ram buffer and index a small dataset with 1024
> >> >> dimensions in HOURS (i didn't ask what hardware).
> >> >>
> >> >> My concerns are everyone else except the one guy, I want it to be
> >> >> usable. Increasing dimensions just means even bigger multi-gigabyte
> >> >> ram buffer and bigger heap to avoid OOM on merge.
> >> >> It is also a permanent backwards compatibility decision, we have to
> >> >> support it once we do this and we can't just say "oops" and flip it
> >> >> back.
> >> >>
> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
> >> >> avoid merges because they are so slow and it would be DAYS otherwise,
> >> >> or if its to avoid merges so it doesn't hit OOM.
> >> >> Also from personal experience, it takes trial and error (means
> >> >> experiencing OOM on merge!!!) before you get those heap values correct
> >> >> for your dataset. This usually means starting over which is
> >> >> frustrating and wastes more time.
> >> >>
> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
> >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
> >> >> avoided in this way and performance improved by writing bigger
> >> >> segments with lucene's defaults. But this doesn't mean we can simply
> >> >> ignore the horrors of what happens on merge. merging needs to scale so
> >> >> that indexing really scales.
> >> >>
> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
> >> >> fashion when indexing.
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [email protected]
> >> >> For additional commands, e-mail: [email protected]
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to