> What exactly do you consider real vector data? Vector data which is based on > texts written by humans?
We have plenty of text; the problem is coming up with a realistic vector model that requires as many dimensions as people seem to be demanding. As I said above, after surveying huggingface I couldn't find any text-based model using more than 768 dimensions. So far we have some ideas of generating higher-dimensional data by dithering or concatenating existing data, but it seems artificial. On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner <michael.wech...@wyona.com> wrote: > > What exactly do you consider real vector data? Vector data which is based on > texts written by humans? > > I am asking, because I recently attended the following presentation by > Anastassia Shaitarova (UZH Institute for Computational Linguistics, > https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html) > > ---- > > Can we Identify Machine-Generated Text? An Overview of Current Approaches > by Anastassia Shaitarova (UZH Institute for Computational Linguistics) > > The detection of machine-generated text has become increasingly important due > to the prevalence of automated content generation and its potential for > misuse. In this talk, we will discuss the motivation for automatic detection > of generated text. We will present the currently available methods, including > feature-based classification as a “first line-of-defense.” We will provide an > overview of the detection tools that have been made available so far and > discuss their limitations. Finally, we will reflect on some open problems > associated with the automatic discrimination of generated texts. > > ---- > > and her conclusion was that it has become basically impossible to > differentiate between text generated by humans and text generated by for > example ChatGPT. > > Whereas others have a slightly different opinion, see for example > > https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/ > > But I would argue that real world and synthetic have become close enough that > testing performance and scalability of indexing should be possible with > synthetic data. > > I completely agree that we have to base our discussions and decisions on > scientific methods and that we have to make sure that Lucene performs and > scales well and that we understand the limits and what is going on under the > hood. > > Thanks > > Michael W > > > > > > Am 11.04.23 um 14:29 schrieb Michael McCandless: > > +1 to test on real vector data -- if you test on synthetic data you draw > synthetic conclusions. > > Can someone post the theoretical performance (CPU and RAM required) of HNSW > construction? Do we know/believe our HNSW implementation has achieved that > theoretical big-O performance? Maybe we have some silly performance bug > that's causing it not to? > > As I understand it, HNSW makes the tradeoff of costly construction for faster > searching, which is typically the right tradeoff for search use cases. We do > this in other parts of the Lucene index too. > > Lucene will do a logarithmic number of merges over time, i.e. each doc will > be merged O(log(N)) times in its lifetime in the index. We need to multiply > that by the cost of re-building the whole HNSW graph on each merge. BTW, > other things in Lucene, like BKD/dimensional points, also rebuild the whole > data structure on each merge, I think? But, as Rob pointed out, stored > fields merging do indeed do some sneaky tricks to avoid excessive block > decompress/recompress on each merge. > > > As I understand it, vetoes must have technical merit. I'm not sure that > > this veto rises to "technical merit" on 2 counts: > > Actually I think Robert's veto stands on its technical merit already. > Robert's take on technical matters very much resonate with me, even if he is > sometimes prickly in how he expresses them ;) > > His point is that we, as a dev community, are not paying enough attention to > the indexing performance of our KNN algo (HNSW) and implementation, and that > it is reckless to increase / remove limits in that state. It is indeed a > one-way door decision and one must confront such decisions with caution, > especially for such a widely used base infrastructure as Lucene. We don't > even advertise today in our javadocs that you need XXX heap if you index > vectors with dimension Y, fanout X, levels Z, etc. > > RAM used during merging is unaffected by dimensionality, but is affected by > fanout, because the HNSW graph (not the raw vectors) is memory resident, I > think? Maybe we could move it off-heap and let the OS manage the memory (and > still document the RAM requirements)? Maybe merge RAM costs should be > accounted for in IW's RAM buffer accounting? It is not today, and there are > some other things that use non-trivial RAM, e.g. the doc mapping (to compress > docid space when deletions are reclaimed). > > When we added KNN vector testing to Lucene's nightly benchmarks, the indexing > time massively increased -- see annotations DH and DP here: > https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly > benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of > course, that is using a single thread for indexing (on a box that has 128 > cores!) so we produce a deterministic index every night ... > > Stepping out (meta) a bit ... this discussion is precisely one of the awesome > benefits of the (informed) veto. It means risky changes to the software, as > determined by any single informed developer on the project, can force a > healthy discussion about the problem at hand. Robert is legitimately > concerned about a real issue and so we should use our creative energies to > characterize our HNSW implementation's performance, document it clearly for > users, and uncover ways to improve it. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti <a.benede...@sease.io> > wrote: >> >> I think Gus points are on target. >> >> I recommend we move this forward in this way: >> We stop any discussion and everyone interested proposes an option with a >> motivation, then we aggregate the options and we create a Vote maybe? >> >> I am also on the same page on the fact that a veto should come with a clear >> and reasonable technical merit, which also in my opinion has not come yet. >> >> I also apologise if any of my words sounded harsh or personal attacks, never >> meant to do so. >> >> My proposed option: >> >> 1) remove the limit and potentially make it configurable, >> Motivation: >> The system administrator can enforce a limit its users need to respect that >> it's in line with whatever the admin decided to be acceptable for them. >> Default can stay the current one. >> >> That's my favourite at the moment, but I agree that potentially in the >> future this may need to change, as we may optimise the data structures for >> certain dimensions. I am a big fan of Yagni (you aren't going to need it) >> so I am ok we'll face a different discussion if that happens in the future. >> >> >> >> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.h...@gmail.com> wrote: >>> >>> What I see so far: >>> >>> Much positive support for raising the limit >>> Slightly less support for removing it or making it configurable >>> A single veto which argues that a (as yet undefined) performance standard >>> must be met before raising the limit >>> Hot tempers (various) making this discussion difficult >>> >>> As I understand it, vetoes must have technical merit. I'm not sure that >>> this veto rises to "technical merit" on 2 counts: >>> >>> No standard for the performance is given so it cannot be technically met. >>> Without hard criteria it's a moving target. >>> It appears to encode a valuation of the user's time, and that valuation is >>> really up to the user. Some users may consider 2hours useless and not worth >>> it, and others might happily wait 2 hours. This is not a technical >>> decision, it's a business decision regarding the relative value of the time >>> invested vs the value of the result. If I can cure cancer by indexing for a >>> year, that might be worth it... (hyperbole of course). >>> >>> Things I would consider to have technical merit that I don't hear: >>> >>> Impact on the speed of **other** indexing operations. (devaluation of other >>> functionality) >>> Actual scenarios that work when the limit is low and fail when the limit is >>> high (new failure on the same data with the limit raised). >>> >>> One thing that might or might not have technical merit >>> >>> If someone feels there is a lack of documentation of the costs/performance >>> implications of using large vectors, possibly including reproducible >>> benchmarks establishing the scaling behavior (there seems to be >>> disagreement on O(n) vs O(n^2)). >>> >>> The users *should* know what they are getting into, but if the cost is >>> worth it to them, they should be able to pay it without forking the >>> project. If this veto causes a fork that's not good. >>> >>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <msoko...@gmail.com> wrote: >>>> >>>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100 >>>> and 300 dimensional varieties and can easily enough generate large numbers >>>> of vector documents from the articles data. To go higher we could >>>> concatenate vectors from that and I believe the performance numbers would >>>> be plausible. >>>> >>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.we...@gmail.com> wrote: >>>>> >>>>> Can we set up a branch in which the limit is bumped to 2048, then have >>>>> a realistic, free data set (wikipedia sample or something) that has, >>>>> say, 5 million docs and vectors created using public data (glove >>>>> pre-trained embeddings or the like)? We then could run indexing on the >>>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits >>>>> and behavior actually are. >>>>> >>>>> I can help in writing this but not until after Easter. >>>>> >>>>> >>>>> Dawid >>>>> >>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> wrote: >>>>> > >>>>> > As Dawid pointed out earlier on this thread, this is the rule for >>>>> > Apache projects: a single -1 vote on a code change is a veto and >>>>> > cannot be overridden. Furthermore, Robert is one of the people on this >>>>> > project who worked the most on debugging subtle bugs, making Lucene >>>>> > more robust and improving our test framework, so I'm listening when he >>>>> > voices quality concerns. >>>>> > >>>>> > The argument against removing/raising the limit that resonates with me >>>>> > the most is that it is a one-way door. As MikeS highlighted earlier on >>>>> > this thread, implementations may want to take advantage of the fact >>>>> > that there is a limit at some point too. This is why I don't want to >>>>> > remove the limit and would prefer a slight increase, such as 2048 as >>>>> > suggested in the original issue, which would enable most of the things >>>>> > that users who have been asking about raising the limit would like to >>>>> > do. >>>>> > >>>>> > I agree that the merge-time memory usage and slow indexing rate are >>>>> > not great. But it's still possible to index multi-million vector >>>>> > datasets with a 4GB heap without hitting OOMEs regardless of the >>>>> > number of dimensions, and the feedback I'm seeing is that many users >>>>> > are still interested in indexing multi-million vector datasets despite >>>>> > the slow indexing rate. I wish we could do better, and vector indexing >>>>> > is certainly more expert than text indexing, but it still is usable in >>>>> > my opinion. I understand how giving Lucene more information about >>>>> > vectors prior to indexing (e.g. clustering information as Jim pointed >>>>> > out) could help make merging faster and more memory-efficient, but I >>>>> > would really like to avoid making it a requirement for indexing >>>>> > vectors as it also makes this feature much harder to use. >>>>> > >>>>> > On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti >>>>> > <a.benede...@sease.io> wrote: >>>>> > > >>>>> > > I am very attentive to listen opinions but I am un-convinced here and >>>>> > > I an not sure that a single person opinion should be allowed to be >>>>> > > detrimental for such an important project. >>>>> > > >>>>> > > The limit as far as I know is literally just raising an exception. >>>>> > > Removing it won't alter in any way the current performance for users >>>>> > > in low dimensional space. >>>>> > > Removing it will just enable more users to use Lucene. >>>>> > > >>>>> > > If new users in certain situations will be unhappy with the >>>>> > > performance, they may contribute improvements. >>>>> > > This is how you make progress. >>>>> > > >>>>> > > If it's a reputation thing, trust me that not allowing users to play >>>>> > > with high dimensional space will equally damage it. >>>>> > > >>>>> > > To me it's really a no brainer. >>>>> > > Removing the limit and enable people to use high dimensional vectors >>>>> > > will take minutes. >>>>> > > Improving the hnsw implementation can take months. >>>>> > > Pick one to begin with... >>>>> > > >>>>> > > And there's no-one paying me here, no company interest whatsoever, >>>>> > > actually I pay people to contribute, I am just convinced it's a good >>>>> > > idea. >>>>> > > >>>>> > > >>>>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote: >>>>> > >> >>>>> > >> I disagree with your categorization. I put in plenty of work and >>>>> > >> experienced plenty of pain myself, writing tests and fighting these >>>>> > >> issues, after i saw that, two releases in a row, vector indexing fell >>>>> > >> over and hit integer overflows etc on small datasets: >>>>> > >> >>>>> > >> https://github.com/apache/lucene/pull/11905 >>>>> > >> >>>>> > >> Attacking me isn't helping the situation. >>>>> > >> >>>>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in >>>>> > >> any kind of demeaning fashion really. I meant to describe the current >>>>> > >> state of usability with respect to indexing a few million docs with >>>>> > >> high dimensions. You can scroll up the thread and see that at least >>>>> > >> one other committer on the project experienced similar pain as me. >>>>> > >> Then, think about users who aren't committers trying to use the >>>>> > >> functionality! >>>>> > >> >>>>> > >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com> >>>>> > >> wrote: >>>>> > >> > >>>>> > >> > What you said about increasing dimensions requiring a bigger ram >>>>> > >> > buffer on merge is wrong. That's the point I was trying to make. >>>>> > >> > Your concerns about merge costs are not wrong, but your conclusion >>>>> > >> > that we need to limit dimensions is not justified. >>>>> > >> > >>>>> > >> > You complain that hnsw sucks it doesn't scale, but when I show it >>>>> > >> > scales linearly with dimension you just ignore that and complain >>>>> > >> > about something entirely different. >>>>> > >> > >>>>> > >> > You demand that people run all kinds of tests to prove you wrong >>>>> > >> > but when they do, you don't listen and you won't put in the work >>>>> > >> > yourself or complain that it's too hard. >>>>> > >> > >>>>> > >> > Then you complain about people not meeting you half way. Wow >>>>> > >> > >>>>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> wrote: >>>>> > >> >> >>>>> > >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner >>>>> > >> >> <michael.wech...@wyona.com> wrote: >>>>> > >> >> > >>>>> > >> >> > What exactly do you consider reasonable? >>>>> > >> >> >>>>> > >> >> Let's begin a real discussion by being HONEST about the current >>>>> > >> >> status. Please put politically correct or your own company's >>>>> > >> >> wishes >>>>> > >> >> aside, we know it's not in a good state. >>>>> > >> >> >>>>> > >> >> Current status is the one guy who wrote the code can set a >>>>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024 >>>>> > >> >> dimensions in HOURS (i didn't ask what hardware). >>>>> > >> >> >>>>> > >> >> My concerns are everyone else except the one guy, I want it to be >>>>> > >> >> usable. Increasing dimensions just means even bigger >>>>> > >> >> multi-gigabyte >>>>> > >> >> ram buffer and bigger heap to avoid OOM on merge. >>>>> > >> >> It is also a permanent backwards compatibility decision, we have >>>>> > >> >> to >>>>> > >> >> support it once we do this and we can't just say "oops" and flip >>>>> > >> >> it >>>>> > >> >> back. >>>>> > >> >> >>>>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to >>>>> > >> >> avoid merges because they are so slow and it would be DAYS >>>>> > >> >> otherwise, >>>>> > >> >> or if its to avoid merges so it doesn't hit OOM. >>>>> > >> >> Also from personal experience, it takes trial and error (means >>>>> > >> >> experiencing OOM on merge!!!) before you get those heap values >>>>> > >> >> correct >>>>> > >> >> for your dataset. This usually means starting over which is >>>>> > >> >> frustrating and wastes more time. >>>>> > >> >> >>>>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter, >>>>> > >> >> seems >>>>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer >>>>> > >> >> can be >>>>> > >> >> avoided in this way and performance improved by writing bigger >>>>> > >> >> segments with lucene's defaults. But this doesn't mean we can >>>>> > >> >> simply >>>>> > >> >> ignore the horrors of what happens on merge. merging needs to >>>>> > >> >> scale so >>>>> > >> >> that indexing really scales. >>>>> > >> >> >>>>> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause >>>>> > >> >> OOM, >>>>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2) >>>>> > >> >> fashion when indexing. >>>>> > >> >> >>>>> > >> >> --------------------------------------------------------------------- >>>>> > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> > >> >> >>>>> > >> >>>>> > >> --------------------------------------------------------------------- >>>>> > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> > >> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> > >> >>>>> > >>>>> > >>>>> > -- >>>>> > Adrien >>>>> > >>>>> > --------------------------------------------------------------------- >>>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> > For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> > >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >>> >>> >>> -- >>> http://www.needhamsoftware.com (work) >>> http://www.the111shift.com (play) > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org