I don't think this tone and language is appropriate for a community of volunteers and men of science.
I personally find offensive to generalise Lucene people here to be "crazy people hyped about chatGPT". I personally don't give a damn about chatGPT except the fact it is a very interesting technology. As usual I see very little motivation and a lot of "convince me". We're discussing here about a limit that raises an exception. Improving performance is absolutely important and no-one here is saying we won't address it, it's just a separate discussion. On Sun, 9 Apr 2023, 12:59 Robert Muir, <rcm...@gmail.com> wrote: > Also, please let's only disucss SEARCH. lucene is a SEARCH ENGINE > LIBRARY. not a vector database or whatever trash is being proposed > here. > > i think we should table this and revisit it after chatgpt hype has > dissipated. > > this hype is causing ppl to behave irrationally, it is why i can't > converse with basically anyone on this thread because they are all > stating crazy things that don't make sense. > > On Sun, Apr 9, 2023 at 6:25 AM Robert Muir <rcm...@gmail.com> wrote: > > > > Yes, its very clear that folks on this thread are ignoring reason > > entirely and completely swooned by chatgpt-hype. > > And what happens when they make chatgpt-8 that uses even more dimensions? > > backwards compatibility decisions can't be made by garbage hype such > > as cryptocurrency or chatgpt. > > Trying to convince me we should bump it because of chatgpt, well, i > > think it has the opposite effect. > > > > Please, lemme see real technical arguments why this limit needs to be > > bumped. not including trash like chatgpt. > > > > On Sat, Apr 8, 2023 at 7:50 PM Marcus Eagan <marcusea...@gmail.com> > wrote: > > > > > > Given the massive amounts of funding going into the development and > investigation of the project, I think it would be good to at least have > Lucene be a part of the conversation. Simply because academics typically > focus on vectors <= 784 dimensions does not mean all users will. A large > swathe of very important users of the Lucene project never exceed 500k > documents, though they are shifting to other search engines to try out very > popular embeddings. > > > > > > I think giving our users the opportunity to build chat bots or LLM > memory machines using Lucene is a positive development, even if some > datasets won't be able to work well. We don't limit the number of fields > someone can add in most cases, though we did just undeprecate that API to > better support multi-tenancy. But people still add so many fields and can > crash their clusters with mapping explosions when unlimited. The limit to > vectors feels similar. I expect more people to dig into Lucene due to its > openness and robustness as they run into problems. Today, they are forced > to consider other engines that are more permissive. > > > > > > Not everyone important or valuable Lucene workload is in the millions > of documents. Many of them only have lots of queries or computationally > expensive access patterns for B-trees. We can document that it is very > ill-advised to make a deployment with vectors too large. What others will > do with it is on them. > > > > > > > > > On Sat, Apr 8, 2023 at 2:29 PM Adrien Grand <jpou...@gmail.com> wrote: > > >> > > >> As Dawid pointed out earlier on this thread, this is the rule for > > >> Apache projects: a single -1 vote on a code change is a veto and > > >> cannot be overridden. Furthermore, Robert is one of the people on this > > >> project who worked the most on debugging subtle bugs, making Lucene > > >> more robust and improving our test framework, so I'm listening when he > > >> voices quality concerns. > > >> > > >> The argument against removing/raising the limit that resonates with me > > >> the most is that it is a one-way door. As MikeS highlighted earlier on > > >> this thread, implementations may want to take advantage of the fact > > >> that there is a limit at some point too. This is why I don't want to > > >> remove the limit and would prefer a slight increase, such as 2048 as > > >> suggested in the original issue, which would enable most of the things > > >> that users who have been asking about raising the limit would like to > > >> do. > > >> > > >> I agree that the merge-time memory usage and slow indexing rate are > > >> not great. But it's still possible to index multi-million vector > > >> datasets with a 4GB heap without hitting OOMEs regardless of the > > >> number of dimensions, and the feedback I'm seeing is that many users > > >> are still interested in indexing multi-million vector datasets despite > > >> the slow indexing rate. I wish we could do better, and vector indexing > > >> is certainly more expert than text indexing, but it still is usable in > > >> my opinion. I understand how giving Lucene more information about > > >> vectors prior to indexing (e.g. clustering information as Jim pointed > > >> out) could help make merging faster and more memory-efficient, but I > > >> would really like to avoid making it a requirement for indexing > > >> vectors as it also makes this feature much harder to use. > > >> > > >> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti > > >> <a.benede...@sease.io> wrote: > > >> > > > >> > I am very attentive to listen opinions but I am un-convinced here > and I an not sure that a single person opinion should be allowed to be > detrimental for such an important project. > > >> > > > >> > The limit as far as I know is literally just raising an exception. > > >> > Removing it won't alter in any way the current performance for > users in low dimensional space. > > >> > Removing it will just enable more users to use Lucene. > > >> > > > >> > If new users in certain situations will be unhappy with the > performance, they may contribute improvements. > > >> > This is how you make progress. > > >> > > > >> > If it's a reputation thing, trust me that not allowing users to > play with high dimensional space will equally damage it. > > >> > > > >> > To me it's really a no brainer. > > >> > Removing the limit and enable people to use high dimensional > vectors will take minutes. > > >> > Improving the hnsw implementation can take months. > > >> > Pick one to begin with... > > >> > > > >> > And there's no-one paying me here, no company interest whatsoever, > actually I pay people to contribute, I am just convinced it's a good idea. > > >> > > > >> > > > >> > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote: > > >> >> > > >> >> I disagree with your categorization. I put in plenty of work and > > >> >> experienced plenty of pain myself, writing tests and fighting these > > >> >> issues, after i saw that, two releases in a row, vector indexing > fell > > >> >> over and hit integer overflows etc on small datasets: > > >> >> > > >> >> https://github.com/apache/lucene/pull/11905 > > >> >> > > >> >> Attacking me isn't helping the situation. > > >> >> > > >> >> PS: when i said the "one guy who wrote the code" I didn't mean it > in > > >> >> any kind of demeaning fashion really. I meant to describe the > current > > >> >> state of usability with respect to indexing a few million docs with > > >> >> high dimensions. You can scroll up the thread and see that at least > > >> >> one other committer on the project experienced similar pain as me. > > >> >> Then, think about users who aren't committers trying to use the > > >> >> functionality! > > >> >> > > >> >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov < > msoko...@gmail.com> wrote: > > >> >> > > > >> >> > What you said about increasing dimensions requiring a bigger ram > buffer on merge is wrong. That's the point I was trying to make. Your > concerns about merge costs are not wrong, but your conclusion that we need > to limit dimensions is not justified. > > >> >> > > > >> >> > You complain that hnsw sucks it doesn't scale, but when I show > it scales linearly with dimension you just ignore that and complain about > something entirely different. > > >> >> > > > >> >> > You demand that people run all kinds of tests to prove you wrong > but when they do, you don't listen and you won't put in the work yourself > or complain that it's too hard. > > >> >> > > > >> >> > Then you complain about people not meeting you half way. Wow > > >> >> > > > >> >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> > wrote: > > >> >> >> > > >> >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner > > >> >> >> <michael.wech...@wyona.com> wrote: > > >> >> >> > > > >> >> >> > What exactly do you consider reasonable? > > >> >> >> > > >> >> >> Let's begin a real discussion by being HONEST about the current > > >> >> >> status. Please put politically correct or your own company's > wishes > > >> >> >> aside, we know it's not in a good state. > > >> >> >> > > >> >> >> Current status is the one guy who wrote the code can set a > > >> >> >> multi-gigabyte ram buffer and index a small dataset with 1024 > > >> >> >> dimensions in HOURS (i didn't ask what hardware). > > >> >> >> > > >> >> >> My concerns are everyone else except the one guy, I want it to > be > > >> >> >> usable. Increasing dimensions just means even bigger > multi-gigabyte > > >> >> >> ram buffer and bigger heap to avoid OOM on merge. > > >> >> >> It is also a permanent backwards compatibility decision, we > have to > > >> >> >> support it once we do this and we can't just say "oops" and > flip it > > >> >> >> back. > > >> >> >> > > >> >> >> It is unclear to me, if the multi-gigabyte ram buffer is really > to > > >> >> >> avoid merges because they are so slow and it would be DAYS > otherwise, > > >> >> >> or if its to avoid merges so it doesn't hit OOM. > > >> >> >> Also from personal experience, it takes trial and error (means > > >> >> >> experiencing OOM on merge!!!) before you get those heap values > correct > > >> >> >> for your dataset. This usually means starting over which is > > >> >> >> frustrating and wastes more time. > > >> >> >> > > >> >> >> Jim mentioned some ideas about the memory usage in IndexWriter, > seems > > >> >> >> to me like its a good idea. maybe the multigigabyte ram buffer > can be > > >> >> >> avoided in this way and performance improved by writing bigger > > >> >> >> segments with lucene's defaults. But this doesn't mean we can > simply > > >> >> >> ignore the horrors of what happens on merge. merging needs to > scale so > > >> >> >> that indexing really scales. > > >> >> >> > > >> >> >> At least it shouldnt spike RAM on trivial data amounts and > cause OOM, > > >> >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2) > > >> >> >> fashion when indexing. > > >> >> >> > > >> >> >> > --------------------------------------------------------------------- > > >> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org > > >> >> >> > > >> >> > > >> >> > --------------------------------------------------------------------- > > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org > > >> >> > > >> > > >> > > >> -- > > >> Adrien > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: dev-h...@lucene.apache.org > > >> > > > > > > > > > -- > > > Marcus Eagan > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >