Also technically, it's just the threat of a veto since we are not actually in a vote thread....
On Sun, Apr 9, 2023 at 12:46 PM Gus Heck <gus.h...@gmail.com> wrote: > What I see so far: > > 1. Much positive support for raising the limit > 2. Slightly less support for removing it or making it configurable > 3. A single veto which argues that a (as yet undefined) performance > standard must be met before raising the limit > 4. Hot tempers (various) making this discussion difficult > > As I understand it, vetoes must have technical merit. I'm not sure that > this veto rises to "technical merit" on 2 counts: > > 1. No standard for the performance is given so it cannot be > technically met. Without hard criteria it's a moving target. > 2. It appears to encode a valuation of the user's time, and that > valuation is really up to the user. Some users may consider 2hours useless > and not worth it, and others might happily wait 2 hours. This is not a > technical decision, it's a business decision regarding the relative value > of the time invested vs the value of the result. If I can cure cancer by > indexing for a year, that might be worth it... (hyperbole of course). > > Things I would consider to have technical merit that I don't hear: > > 1. Impact on the speed of **other** indexing operations. (devaluation > of other functionality) > 2. Actual scenarios that work when the limit is low and fail when the > limit is high (new failure on the same data with the limit raised). > > One thing that might or might not have technical merit > > 1. If someone feels there is a lack of documentation of the > costs/performance implications of using large vectors, possibly including > reproducible benchmarks establishing the scaling behavior (there seems to > be disagreement on O(n) vs O(n^2)). > > The users *should* know what they are getting into, but if the cost is > worth it to them, they should be able to pay it without forking the > project. If this veto causes a fork that's not good. > > On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <msoko...@gmail.com> wrote: > >> We do have a dataset built from Wikipedia in luceneutil. It comes in 100 >> and 300 dimensional varieties and can easily enough generate large numbers >> of vector documents from the articles data. To go higher we could >> concatenate vectors from that and I believe the performance numbers would >> be plausible. >> >> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.we...@gmail.com> wrote: >> >>> Can we set up a branch in which the limit is bumped to 2048, then have >>> a realistic, free data set (wikipedia sample or something) that has, >>> say, 5 million docs and vectors created using public data (glove >>> pre-trained embeddings or the like)? We then could run indexing on the >>> same hardware with 512, 1024 and 2048 and see what the numbers, limits >>> and behavior actually are. >>> >>> I can help in writing this but not until after Easter. >>> >>> >>> Dawid >>> >>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> wrote: >>> > >>> > As Dawid pointed out earlier on this thread, this is the rule for >>> > Apache projects: a single -1 vote on a code change is a veto and >>> > cannot be overridden. Furthermore, Robert is one of the people on this >>> > project who worked the most on debugging subtle bugs, making Lucene >>> > more robust and improving our test framework, so I'm listening when he >>> > voices quality concerns. >>> > >>> > The argument against removing/raising the limit that resonates with me >>> > the most is that it is a one-way door. As MikeS highlighted earlier on >>> > this thread, implementations may want to take advantage of the fact >>> > that there is a limit at some point too. This is why I don't want to >>> > remove the limit and would prefer a slight increase, such as 2048 as >>> > suggested in the original issue, which would enable most of the things >>> > that users who have been asking about raising the limit would like to >>> > do. >>> > >>> > I agree that the merge-time memory usage and slow indexing rate are >>> > not great. But it's still possible to index multi-million vector >>> > datasets with a 4GB heap without hitting OOMEs regardless of the >>> > number of dimensions, and the feedback I'm seeing is that many users >>> > are still interested in indexing multi-million vector datasets despite >>> > the slow indexing rate. I wish we could do better, and vector indexing >>> > is certainly more expert than text indexing, but it still is usable in >>> > my opinion. I understand how giving Lucene more information about >>> > vectors prior to indexing (e.g. clustering information as Jim pointed >>> > out) could help make merging faster and more memory-efficient, but I >>> > would really like to avoid making it a requirement for indexing >>> > vectors as it also makes this feature much harder to use. >>> > >>> > On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti >>> > <a.benede...@sease.io> wrote: >>> > > >>> > > I am very attentive to listen opinions but I am un-convinced here >>> and I an not sure that a single person opinion should be allowed to be >>> detrimental for such an important project. >>> > > >>> > > The limit as far as I know is literally just raising an exception. >>> > > Removing it won't alter in any way the current performance for users >>> in low dimensional space. >>> > > Removing it will just enable more users to use Lucene. >>> > > >>> > > If new users in certain situations will be unhappy with the >>> performance, they may contribute improvements. >>> > > This is how you make progress. >>> > > >>> > > If it's a reputation thing, trust me that not allowing users to play >>> with high dimensional space will equally damage it. >>> > > >>> > > To me it's really a no brainer. >>> > > Removing the limit and enable people to use high dimensional vectors >>> will take minutes. >>> > > Improving the hnsw implementation can take months. >>> > > Pick one to begin with... >>> > > >>> > > And there's no-one paying me here, no company interest whatsoever, >>> actually I pay people to contribute, I am just convinced it's a good idea. >>> > > >>> > > >>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote: >>> > >> >>> > >> I disagree with your categorization. I put in plenty of work and >>> > >> experienced plenty of pain myself, writing tests and fighting these >>> > >> issues, after i saw that, two releases in a row, vector indexing >>> fell >>> > >> over and hit integer overflows etc on small datasets: >>> > >> >>> > >> https://github.com/apache/lucene/pull/11905 >>> > >> >>> > >> Attacking me isn't helping the situation. >>> > >> >>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in >>> > >> any kind of demeaning fashion really. I meant to describe the >>> current >>> > >> state of usability with respect to indexing a few million docs with >>> > >> high dimensions. You can scroll up the thread and see that at least >>> > >> one other committer on the project experienced similar pain as me. >>> > >> Then, think about users who aren't committers trying to use the >>> > >> functionality! >>> > >> >>> > >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com> >>> wrote: >>> > >> > >>> > >> > What you said about increasing dimensions requiring a bigger ram >>> buffer on merge is wrong. That's the point I was trying to make. Your >>> concerns about merge costs are not wrong, but your conclusion that we need >>> to limit dimensions is not justified. >>> > >> > >>> > >> > You complain that hnsw sucks it doesn't scale, but when I show it >>> scales linearly with dimension you just ignore that and complain about >>> something entirely different. >>> > >> > >>> > >> > You demand that people run all kinds of tests to prove you wrong >>> but when they do, you don't listen and you won't put in the work yourself >>> or complain that it's too hard. >>> > >> > >>> > >> > Then you complain about people not meeting you half way. Wow >>> > >> > >>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> >>> wrote: >>> > >> >> >>> > >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner >>> > >> >> <michael.wech...@wyona.com> wrote: >>> > >> >> > >>> > >> >> > What exactly do you consider reasonable? >>> > >> >> >>> > >> >> Let's begin a real discussion by being HONEST about the current >>> > >> >> status. Please put politically correct or your own company's >>> wishes >>> > >> >> aside, we know it's not in a good state. >>> > >> >> >>> > >> >> Current status is the one guy who wrote the code can set a >>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024 >>> > >> >> dimensions in HOURS (i didn't ask what hardware). >>> > >> >> >>> > >> >> My concerns are everyone else except the one guy, I want it to be >>> > >> >> usable. Increasing dimensions just means even bigger >>> multi-gigabyte >>> > >> >> ram buffer and bigger heap to avoid OOM on merge. >>> > >> >> It is also a permanent backwards compatibility decision, we have >>> to >>> > >> >> support it once we do this and we can't just say "oops" and flip >>> it >>> > >> >> back. >>> > >> >> >>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really >>> to >>> > >> >> avoid merges because they are so slow and it would be DAYS >>> otherwise, >>> > >> >> or if its to avoid merges so it doesn't hit OOM. >>> > >> >> Also from personal experience, it takes trial and error (means >>> > >> >> experiencing OOM on merge!!!) before you get those heap values >>> correct >>> > >> >> for your dataset. This usually means starting over which is >>> > >> >> frustrating and wastes more time. >>> > >> >> >>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter, >>> seems >>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer >>> can be >>> > >> >> avoided in this way and performance improved by writing bigger >>> > >> >> segments with lucene's defaults. But this doesn't mean we can >>> simply >>> > >> >> ignore the horrors of what happens on merge. merging needs to >>> scale so >>> > >> >> that indexing really scales. >>> > >> >> >>> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause >>> OOM, >>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2) >>> > >> >> fashion when indexing. >>> > >> >> >>> > >> >> >>> --------------------------------------------------------------------- >>> > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>> > >> >> >>> > >> >>> > >> >>> --------------------------------------------------------------------- >>> > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> > >> For additional commands, e-mail: dev-h...@lucene.apache.org >>> > >> >>> > >>> > >>> > -- >>> > Adrien >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> > For additional commands, e-mail: dev-h...@lucene.apache.org >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play) > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)