Can the limit be raised using Java reflection at run time? Or is there more to it that needs to be changed?
On Sun, 9 Apr, 2023, 12:58 am Alessandro Benedetti, <a.benede...@sease.io> wrote: > I am very attentive to listen opinions but I am un-convinced here and I an > not sure that a single person opinion should be allowed to be detrimental > for such an important project. > > The limit as far as I know is literally just raising an exception. > Removing it won't alter in any way the current performance for users in > low dimensional space. > Removing it will just enable more users to use Lucene. > > If new users in certain situations will be unhappy with the performance, > they may contribute improvements. > This is how you make progress. > > If it's a reputation thing, trust me that not allowing users to play with > high dimensional space will equally damage it. > > To me it's really a no brainer. > Removing the limit and enable people to use high dimensional vectors will > take minutes. > Improving the hnsw implementation can take months. > Pick one to begin with... > > And there's no-one paying me here, no company interest whatsoever, > actually I pay people to contribute, I am just convinced it's a good idea. > > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote: > >> I disagree with your categorization. I put in plenty of work and >> experienced plenty of pain myself, writing tests and fighting these >> issues, after i saw that, two releases in a row, vector indexing fell >> over and hit integer overflows etc on small datasets: >> >> https://github.com/apache/lucene/pull/11905 >> >> Attacking me isn't helping the situation. >> >> PS: when i said the "one guy who wrote the code" I didn't mean it in >> any kind of demeaning fashion really. I meant to describe the current >> state of usability with respect to indexing a few million docs with >> high dimensions. You can scroll up the thread and see that at least >> one other committer on the project experienced similar pain as me. >> Then, think about users who aren't committers trying to use the >> functionality! >> >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com> >> wrote: >> > >> > What you said about increasing dimensions requiring a bigger ram buffer >> on merge is wrong. That's the point I was trying to make. Your concerns >> about merge costs are not wrong, but your conclusion that we need to limit >> dimensions is not justified. >> > >> > You complain that hnsw sucks it doesn't scale, but when I show it >> scales linearly with dimension you just ignore that and complain about >> something entirely different. >> > >> > You demand that people run all kinds of tests to prove you wrong but >> when they do, you don't listen and you won't put in the work yourself or >> complain that it's too hard. >> > >> > Then you complain about people not meeting you half way. Wow >> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> wrote: >> >> >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner >> >> <michael.wech...@wyona.com> wrote: >> >> > >> >> > What exactly do you consider reasonable? >> >> >> >> Let's begin a real discussion by being HONEST about the current >> >> status. Please put politically correct or your own company's wishes >> >> aside, we know it's not in a good state. >> >> >> >> Current status is the one guy who wrote the code can set a >> >> multi-gigabyte ram buffer and index a small dataset with 1024 >> >> dimensions in HOURS (i didn't ask what hardware). >> >> >> >> My concerns are everyone else except the one guy, I want it to be >> >> usable. Increasing dimensions just means even bigger multi-gigabyte >> >> ram buffer and bigger heap to avoid OOM on merge. >> >> It is also a permanent backwards compatibility decision, we have to >> >> support it once we do this and we can't just say "oops" and flip it >> >> back. >> >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to >> >> avoid merges because they are so slow and it would be DAYS otherwise, >> >> or if its to avoid merges so it doesn't hit OOM. >> >> Also from personal experience, it takes trial and error (means >> >> experiencing OOM on merge!!!) before you get those heap values correct >> >> for your dataset. This usually means starting over which is >> >> frustrating and wastes more time. >> >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be >> >> avoided in this way and performance improved by writing bigger >> >> segments with lucene's defaults. But this doesn't mean we can simply >> >> ignore the horrors of what happens on merge. merging needs to scale so >> >> that indexing really scales. >> >> >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM, >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2) >> >> fashion when indexing. >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>