Re: [Proposal] Remove max number of dimensions for KNN vectors

Ishan Chattopadhyaya Sat, 08 Apr 2023 12:54:14 -0700

Can the limit be raised using Java reflection at run time? Or is there more
to it that needs to be changed?


On Sun, 9 Apr, 2023, 12:58 am Alessandro Benedetti, <a.benede...@sease.io>
wrote:

> I am very attentive to listen opinions but I am un-convinced here and I an
> not sure that a single person opinion should be allowed to be detrimental
> for such an important project.
>
> The limit as far as I know is literally just raising an exception.
> Removing it won't alter in any way the current performance for users in
> low dimensional space.
> Removing it will just enable more users to use Lucene.
>
> If new users in certain situations will be unhappy with the performance,
> they may contribute improvements.
> This is how you make progress.
>
> If it's a reputation thing, trust me that not allowing users to play with
> high dimensional space will equally damage it.
>
> To me it's really a no brainer.
> Removing the limit and enable people to use high dimensional vectors will
> take minutes.
> Improving the hnsw implementation can take months.
> Pick one to begin with...
>
> And there's no-one paying me here, no company interest whatsoever,
> actually I pay people to contribute, I am just convinced it's a good idea.
>
>
> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote:
>
>> I disagree with your categorization. I put in plenty of work and
>> experienced plenty of pain myself, writing tests and fighting these
>> issues, after i saw that, two releases in a row, vector indexing fell
>> over and hit integer overflows etc on small datasets:
>>
>> https://github.com/apache/lucene/pull/11905
>>
>> Attacking me isn't helping the situation.
>>
>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> any kind of demeaning fashion really. I meant to describe the current
>> state of usability with respect to indexing a few million docs with
>> high dimensions. You can scroll up the thread and see that at least
>> one other committer on the project experienced similar pain as me.
>> Then, think about users who aren't committers trying to use the
>> functionality!
>>
>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>> >
>> > What you said about increasing dimensions requiring a bigger ram buffer
>> on merge is wrong. That's the point I was trying to make. Your concerns
>> about merge costs are not wrong, but your conclusion that we need to limit
>> dimensions is not justified.
>> >
>> > You complain that hnsw sucks it doesn't scale, but when I show it
>> scales linearly with dimension you just ignore that and complain about
>> something entirely different.
>> >
>> > You demand that people run all kinds of tests to prove you wrong but
>> when they do, you don't listen and you won't put in the work yourself or
>> complain that it's too hard.
>> >
>> > Then you complain about people not meeting you half way. Wow
>> >
>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> wrote:
>> >>
>> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>> >> <michael.wech...@wyona.com> wrote:
>> >> >
>> >> > What exactly do you consider reasonable?
>> >>
>> >> Let's begin a real discussion by being HONEST about the current
>> >> status. Please put politically correct or your own company's wishes
>> >> aside, we know it's not in a good state.
>> >>
>> >> Current status is the one guy who wrote the code can set a
>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> >> dimensions in HOURS (i didn't ask what hardware).
>> >>
>> >> My concerns are everyone else except the one guy, I want it to be
>> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>> >> ram buffer and bigger heap to avoid OOM on merge.
>> >> It is also a permanent backwards compatibility decision, we have to
>> >> support it once we do this and we can't just say "oops" and flip it
>> >> back.
>> >>
>> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> >> avoid merges because they are so slow and it would be DAYS otherwise,
>> >> or if its to avoid merges so it doesn't hit OOM.
>> >> Also from personal experience, it takes trial and error (means
>> >> experiencing OOM on merge!!!) before you get those heap values correct
>> >> for your dataset. This usually means starting over which is
>> >> frustrating and wastes more time.
>> >>
>> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>> >> avoided in this way and performance improved by writing bigger
>> >> segments with lucene's defaults. But this doesn't mean we can simply
>> >> ignore the horrors of what happens on merge. merging needs to scale so
>> >> that indexing really scales.
>> >>
>> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> >> fashion when indexing.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to