Re: [Proposal] Remove max number of dimensions for KNN vectors

Gus Heck Sun, 09 Apr 2023 10:02:48 -0700

Also technically, it's just the threat of a veto since we are not actually
in a vote thread....


On Sun, Apr 9, 2023 at 12:46 PM Gus Heck <gus.h...@gmail.com> wrote:

> What I see so far:
>
>    1. Much positive support for raising the limit
>    2. Slightly less support for removing it or making it configurable
>    3. A single veto which argues that a (as yet undefined) performance
>    standard must be met before raising the limit
>    4. Hot tempers (various) making this discussion difficult
>
> As I understand it, vetoes must have technical merit. I'm not sure that
> this veto rises to "technical merit" on 2 counts:
>
>    1. No standard for the performance is given so it cannot be
>    technically met. Without hard criteria it's a moving target.
>    2. It appears to encode a valuation of the user's time, and that
>    valuation is really up to the user. Some users may consider 2hours useless
>    and not worth it, and others might happily wait 2 hours. This is not a
>    technical decision, it's a business decision regarding the relative value
>    of the time invested vs the value of the result. If I can cure cancer by
>    indexing for a year, that might be worth it... (hyperbole of course).
>
> Things I would consider to have technical merit that I don't hear:
>
>    1. Impact on the speed of **other** indexing operations. (devaluation
>    of other functionality)
>    2. Actual scenarios that work when the limit is low and fail when the
>    limit is high (new failure on the same data with the limit raised).
>
> One thing that might or might not have technical merit
>
>    1. If someone feels there is a lack of documentation of the
>    costs/performance implications of using large vectors, possibly including
>    reproducible benchmarks establishing the scaling behavior (there seems to
>    be disagreement on O(n) vs O(n^2)).
>
> The users *should* know what they are getting into, but if the cost is
> worth it to them, they should be able to pay it without forking the
> project. If this veto causes a fork that's not good.
>
> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <msoko...@gmail.com> wrote:
>
>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100
>> and 300 dimensional varieties and can easily enough generate large numbers
>> of vector documents from the articles data. To go higher we could
>> concatenate vectors from that and I believe the performance numbers would
>> be plausible.
>>
>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
>>
>>> Can we set up a branch in which the limit is bumped to 2048, then have
>>> a realistic, free data set (wikipedia sample or something) that has,
>>> say, 5 million docs and vectors created using public data (glove
>>> pre-trained embeddings or the like)? We then could run indexing on the
>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>>> and behavior actually are.
>>>
>>> I can help in writing this but not until after Easter.
>>>
>>>
>>> Dawid
>>>
>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> wrote:
>>> >
>>> > As Dawid pointed out earlier on this thread, this is the rule for
>>> > Apache projects: a single -1 vote on a code change is a veto and
>>> > cannot be overridden. Furthermore, Robert is one of the people on this
>>> > project who worked the most on debugging subtle bugs, making Lucene
>>> > more robust and improving our test framework, so I'm listening when he
>>> > voices quality concerns.
>>> >
>>> > The argument against removing/raising the limit that resonates with me
>>> > the most is that it is a one-way door. As MikeS highlighted earlier on
>>> > this thread, implementations may want to take advantage of the fact
>>> > that there is a limit at some point too. This is why I don't want to
>>> > remove the limit and would prefer a slight increase, such as 2048 as
>>> > suggested in the original issue, which would enable most of the things
>>> > that users who have been asking about raising the limit would like to
>>> > do.
>>> >
>>> > I agree that the merge-time memory usage and slow indexing rate are
>>> > not great. But it's still possible to index multi-million vector
>>> > datasets with a 4GB heap without hitting OOMEs regardless of the
>>> > number of dimensions, and the feedback I'm seeing is that many users
>>> > are still interested in indexing multi-million vector datasets despite
>>> > the slow indexing rate. I wish we could do better, and vector indexing
>>> > is certainly more expert than text indexing, but it still is usable in
>>> > my opinion. I understand how giving Lucene more information about
>>> > vectors prior to indexing (e.g. clustering information as Jim pointed
>>> > out) could help make merging faster and more memory-efficient, but I
>>> > would really like to avoid making it a requirement for indexing
>>> > vectors as it also makes this feature much harder to use.
>>> >
>>> > On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
>>> > <a.benede...@sease.io> wrote:
>>> > >
>>> > > I am very attentive to listen opinions but I am un-convinced here
>>> and I an not sure that a single person opinion should be allowed to be
>>> detrimental for such an important project.
>>> > >
>>> > > The limit as far as I know is literally just raising an exception.
>>> > > Removing it won't alter in any way the current performance for users
>>> in low dimensional space.
>>> > > Removing it will just enable more users to use Lucene.
>>> > >
>>> > > If new users in certain situations will be unhappy with the
>>> performance, they may contribute improvements.
>>> > > This is how you make progress.
>>> > >
>>> > > If it's a reputation thing, trust me that not allowing users to play
>>> with high dimensional space will equally damage it.
>>> > >
>>> > > To me it's really a no brainer.
>>> > > Removing the limit and enable people to use high dimensional vectors
>>> will take minutes.
>>> > > Improving the hnsw implementation can take months.
>>> > > Pick one to begin with...
>>> > >
>>> > > And there's no-one paying me here, no company interest whatsoever,
>>> actually I pay people to contribute, I am just convinced it's a good idea.
>>> > >
>>> > >
>>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote:
>>> > >>
>>> > >> I disagree with your categorization. I put in plenty of work and
>>> > >> experienced plenty of pain myself, writing tests and fighting these
>>> > >> issues, after i saw that, two releases in a row, vector indexing
>>> fell
>>> > >> over and hit integer overflows etc on small datasets:
>>> > >>
>>> > >> https://github.com/apache/lucene/pull/11905
>>> > >>
>>> > >> Attacking me isn't helping the situation.
>>> > >>
>>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>> > >> any kind of demeaning fashion really. I meant to describe the
>>> current
>>> > >> state of usability with respect to indexing a few million docs with
>>> > >> high dimensions. You can scroll up the thread and see that at least
>>> > >> one other committer on the project experienced similar pain as me.
>>> > >> Then, think about users who aren't committers trying to use the
>>> > >> functionality!
>>> > >>
>>> > >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com>
>>> wrote:
>>> > >> >
>>> > >> > What you said about increasing dimensions requiring a bigger ram
>>> buffer on merge is wrong. That's the point I was trying to make. Your
>>> concerns about merge costs are not wrong, but your conclusion that we need
>>> to limit dimensions is not justified.
>>> > >> >
>>> > >> > You complain that hnsw sucks it doesn't scale, but when I show it
>>> scales linearly with dimension you just ignore that and complain about
>>> something entirely different.
>>> > >> >
>>> > >> > You demand that people run all kinds of tests to prove you wrong
>>> but when they do, you don't listen and you won't put in the work yourself
>>> or complain that it's too hard.
>>> > >> >
>>> > >> > Then you complain about people not meeting you half way. Wow
>>> > >> >
>>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com>
>>> wrote:
>>> > >> >>
>>> > >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>>> > >> >> <michael.wech...@wyona.com> wrote:
>>> > >> >> >
>>> > >> >> > What exactly do you consider reasonable?
>>> > >> >>
>>> > >> >> Let's begin a real discussion by being HONEST about the current
>>> > >> >> status. Please put politically correct or your own company's
>>> wishes
>>> > >> >> aside, we know it's not in a good state.
>>> > >> >>
>>> > >> >> Current status is the one guy who wrote the code can set a
>>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>> > >> >> dimensions in HOURS (i didn't ask what hardware).
>>> > >> >>
>>> > >> >> My concerns are everyone else except the one guy, I want it to be
>>> > >> >> usable. Increasing dimensions just means even bigger
>>> multi-gigabyte
>>> > >> >> ram buffer and bigger heap to avoid OOM on merge.
>>> > >> >> It is also a permanent backwards compatibility decision, we have
>>> to
>>> > >> >> support it once we do this and we can't just say "oops" and flip
>>> it
>>> > >> >> back.
>>> > >> >>
>>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really
>>> to
>>> > >> >> avoid merges because they are so slow and it would be DAYS
>>> otherwise,
>>> > >> >> or if its to avoid merges so it doesn't hit OOM.
>>> > >> >> Also from personal experience, it takes trial and error (means
>>> > >> >> experiencing OOM on merge!!!) before you get those heap values
>>> correct
>>> > >> >> for your dataset. This usually means starting over which is
>>> > >> >> frustrating and wastes more time.
>>> > >> >>
>>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
>>> seems
>>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer
>>> can be
>>> > >> >> avoided in this way and performance improved by writing bigger
>>> > >> >> segments with lucene's defaults. But this doesn't mean we can
>>> simply
>>> > >> >> ignore the horrors of what happens on merge. merging needs to
>>> scale so
>>> > >> >> that indexing really scales.
>>> > >> >>
>>> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause
>>> OOM,
>>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>> > >> >> fashion when indexing.
>>> > >> >>
>>> > >> >>
>>> ---------------------------------------------------------------------
>>> > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> > >> >>
>>> > >>
>>> > >>
>>> ---------------------------------------------------------------------
>>> > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> > >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> > >>
>>> >
>>> >
>>> > --
>>> > Adrien
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to