> What exactly do you consider real vector data? Vector data which is based on 
> texts written by humans?

We have plenty of text; the problem is coming up with a realistic
vector model that requires as many dimensions as people seem to be
demanding. As I said above, after surveying huggingface I couldn't
find any text-based model using more than 768 dimensions. So far we
have some ideas of generating higher-dimensional data by dithering or
concatenating existing data, but it seems artificial.

On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
<michael.wech...@wyona.com> wrote:
>
> What exactly do you consider real vector data? Vector data which is based on 
> texts written by humans?
>
> I am asking, because I recently attended the following presentation by 
> Anastassia Shaitarova (UZH Institute for Computational Linguistics, 
> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>
> ----
>
> Can we Identify Machine-Generated Text? An Overview of Current Approaches
> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
>
> The detection of machine-generated text has become increasingly important due 
> to the prevalence of automated content generation and its potential for 
> misuse. In this talk, we will discuss the motivation for automatic detection 
> of generated text. We will present the currently available methods, including 
> feature-based classification as a “first line-of-defense.” We will provide an 
> overview of the detection tools that have been made available so far and 
> discuss their limitations. Finally, we will reflect on some open problems 
> associated with the automatic discrimination of generated texts.
>
> ----
>
> and her conclusion was that it has become basically impossible to 
> differentiate between text generated by humans and text generated by for 
> example ChatGPT.
>
> Whereas others have a slightly different opinion, see for example
>
> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>
> But I would argue that real world and synthetic have become close enough that 
> testing performance and scalability of indexing should be possible with 
> synthetic data.
>
> I completely agree that we have to base our discussions and decisions on 
> scientific methods and that we have to make sure that Lucene performs and 
> scales well and that we understand the limits and what is going on under the 
> hood.
>
> Thanks
>
> Michael W
>
>
>
>
>
> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>
> +1 to test on real vector data -- if you test on synthetic data you draw 
> synthetic conclusions.
>
> Can someone post the theoretical performance (CPU and RAM required) of HNSW 
> construction?  Do we know/believe our HNSW implementation has achieved that 
> theoretical big-O performance?  Maybe we have some silly performance bug 
> that's causing it not to?
>
> As I understand it, HNSW makes the tradeoff of costly construction for faster 
> searching, which is typically the right tradeoff for search use cases.  We do 
> this in other parts of the Lucene index too.
>
> Lucene will do a logarithmic number of merges over time, i.e. each doc will 
> be merged O(log(N)) times in its lifetime in the index.  We need to multiply 
> that by the cost of re-building the whole HNSW graph on each merge.  BTW, 
> other things in Lucene, like BKD/dimensional points, also rebuild the whole 
> data structure on each merge, I think?  But, as Rob pointed out, stored 
> fields merging do indeed do some sneaky tricks to avoid excessive block 
> decompress/recompress on each merge.
>
> > As I understand it, vetoes must have technical merit. I'm not sure that 
> > this veto rises to "technical merit" on 2 counts:
>
> Actually I think Robert's veto stands on its technical merit already.  
> Robert's take on technical matters very much resonate with me, even if he is 
> sometimes prickly in how he expresses them ;)
>
> His point is that we, as a dev community, are not paying enough attention to 
> the indexing performance of our KNN algo (HNSW) and implementation, and that 
> it is reckless to increase / remove limits in that state.  It is indeed a 
> one-way door decision and one must confront such decisions with caution, 
> especially for such a widely used base infrastructure as Lucene.  We don't 
> even advertise today in our javadocs that you need XXX heap if you index 
> vectors with dimension Y, fanout X, levels Z, etc.
>
> RAM used during merging is unaffected by dimensionality, but is affected by 
> fanout, because the HNSW graph (not the raw vectors) is memory resident, I 
> think?  Maybe we could move it off-heap and let the OS manage the memory (and 
> still document the RAM requirements)?  Maybe merge RAM costs should be 
> accounted for in IW's RAM buffer accounting?  It is not today, and there are 
> some other things that use non-trivial RAM, e.g. the doc mapping (to compress 
> docid space when deletions are reclaimed).
>
> When we added KNN vector testing to Lucene's nightly benchmarks, the indexing 
> time massively increased -- see annotations DH and DP here: 
> https://home.apache.org/~mikemccand/lucenebench/indexing.html.  Nightly 
> benchmarks now start at 6 PM and don't finish until ~14.5 hours later.  Of 
> course, that is using a single thread for indexing (on a box that has 128 
> cores!) so we produce a deterministic index every night ...
>
> Stepping out (meta) a bit ... this discussion is precisely one of the awesome 
> benefits of the (informed) veto.  It means risky changes to the software, as 
> determined by any single informed developer on the project, can force a 
> healthy discussion about the problem at hand.  Robert is legitimately 
> concerned about a real issue and so we should use our creative energies to 
> characterize our HNSW implementation's performance, document it clearly for 
> users, and uncover ways to improve it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti <a.benede...@sease.io> 
> wrote:
>>
>> I think Gus points are on target.
>>
>> I recommend we move this forward in this way:
>> We stop any discussion and everyone interested proposes an option with a 
>> motivation, then we aggregate the options and we create a Vote maybe?
>>
>> I am also on the same page on the fact that a veto should come with a clear 
>> and reasonable technical merit, which also in my opinion has not come yet.
>>
>> I also apologise if any of my words sounded harsh or personal attacks, never 
>> meant to do so.
>>
>> My proposed option:
>>
>> 1) remove the limit and potentially make it configurable,
>> Motivation:
>> The system administrator can enforce a limit its users need to respect that 
>> it's in line with whatever the admin decided to be acceptable for them.
>> Default can stay the current one.
>>
>> That's my favourite at the moment, but I agree that potentially in the 
>> future this may need to change, as we may optimise the data structures for 
>> certain dimensions. I  am a big fan of Yagni (you aren't going to need it) 
>> so I am ok we'll face a different discussion if that happens in the future.
>>
>>
>>
>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.h...@gmail.com> wrote:
>>>
>>> What I see so far:
>>>
>>> Much positive support for raising the limit
>>> Slightly less support for removing it or making it configurable
>>> A single veto which argues that a (as yet undefined) performance standard 
>>> must be met before raising the limit
>>> Hot tempers (various) making this discussion difficult
>>>
>>> As I understand it, vetoes must have technical merit. I'm not sure that 
>>> this veto rises to "technical merit" on 2 counts:
>>>
>>> No standard for the performance is given so it cannot be technically met. 
>>> Without hard criteria it's a moving target.
>>> It appears to encode a valuation of the user's time, and that valuation is 
>>> really up to the user. Some users may consider 2hours useless and not worth 
>>> it, and others might happily wait 2 hours. This is not a technical 
>>> decision, it's a business decision regarding the relative value of the time 
>>> invested vs the value of the result. If I can cure cancer by indexing for a 
>>> year, that might be worth it... (hyperbole of course).
>>>
>>> Things I would consider to have technical merit that I don't hear:
>>>
>>> Impact on the speed of **other** indexing operations. (devaluation of other 
>>> functionality)
>>> Actual scenarios that work when the limit is low and fail when the limit is 
>>> high (new failure on the same data with the limit raised).
>>>
>>> One thing that might or might not have technical merit
>>>
>>> If someone feels there is a lack of documentation of the costs/performance 
>>> implications of using large vectors, possibly including reproducible 
>>> benchmarks establishing the scaling behavior (there seems to be 
>>> disagreement on O(n) vs O(n^2)).
>>>
>>> The users *should* know what they are getting into, but if the cost is 
>>> worth it to them, they should be able to pay it without forking the 
>>> project. If this veto causes a fork that's not good.
>>>
>>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <msoko...@gmail.com> wrote:
>>>>
>>>> We do have a dataset built from Wikipedia in luceneutil. It comes in 100 
>>>> and 300 dimensional varieties and can easily enough generate large numbers 
>>>> of vector documents from the articles data. To go higher we could 
>>>> concatenate vectors from that and I believe the performance numbers would 
>>>> be plausible.
>>>>
>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
>>>>>
>>>>> Can we set up a branch in which the limit is bumped to 2048, then have
>>>>> a realistic, free data set (wikipedia sample or something) that has,
>>>>> say, 5 million docs and vectors created using public data (glove
>>>>> pre-trained embeddings or the like)? We then could run indexing on the
>>>>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>>>>> and behavior actually are.
>>>>>
>>>>> I can help in writing this but not until after Easter.
>>>>>
>>>>>
>>>>> Dawid
>>>>>
>>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> wrote:
>>>>> >
>>>>> > As Dawid pointed out earlier on this thread, this is the rule for
>>>>> > Apache projects: a single -1 vote on a code change is a veto and
>>>>> > cannot be overridden. Furthermore, Robert is one of the people on this
>>>>> > project who worked the most on debugging subtle bugs, making Lucene
>>>>> > more robust and improving our test framework, so I'm listening when he
>>>>> > voices quality concerns.
>>>>> >
>>>>> > The argument against removing/raising the limit that resonates with me
>>>>> > the most is that it is a one-way door. As MikeS highlighted earlier on
>>>>> > this thread, implementations may want to take advantage of the fact
>>>>> > that there is a limit at some point too. This is why I don't want to
>>>>> > remove the limit and would prefer a slight increase, such as 2048 as
>>>>> > suggested in the original issue, which would enable most of the things
>>>>> > that users who have been asking about raising the limit would like to
>>>>> > do.
>>>>> >
>>>>> > I agree that the merge-time memory usage and slow indexing rate are
>>>>> > not great. But it's still possible to index multi-million vector
>>>>> > datasets with a 4GB heap without hitting OOMEs regardless of the
>>>>> > number of dimensions, and the feedback I'm seeing is that many users
>>>>> > are still interested in indexing multi-million vector datasets despite
>>>>> > the slow indexing rate. I wish we could do better, and vector indexing
>>>>> > is certainly more expert than text indexing, but it still is usable in
>>>>> > my opinion. I understand how giving Lucene more information about
>>>>> > vectors prior to indexing (e.g. clustering information as Jim pointed
>>>>> > out) could help make merging faster and more memory-efficient, but I
>>>>> > would really like to avoid making it a requirement for indexing
>>>>> > vectors as it also makes this feature much harder to use.
>>>>> >
>>>>> > On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
>>>>> > <a.benede...@sease.io> wrote:
>>>>> > >
>>>>> > > I am very attentive to listen opinions but I am un-convinced here and 
>>>>> > > I an not sure that a single person opinion should be allowed to be 
>>>>> > > detrimental for such an important project.
>>>>> > >
>>>>> > > The limit as far as I know is literally just raising an exception.
>>>>> > > Removing it won't alter in any way the current performance for users 
>>>>> > > in low dimensional space.
>>>>> > > Removing it will just enable more users to use Lucene.
>>>>> > >
>>>>> > > If new users in certain situations will be unhappy with the 
>>>>> > > performance, they may contribute improvements.
>>>>> > > This is how you make progress.
>>>>> > >
>>>>> > > If it's a reputation thing, trust me that not allowing users to play 
>>>>> > > with high dimensional space will equally damage it.
>>>>> > >
>>>>> > > To me it's really a no brainer.
>>>>> > > Removing the limit and enable people to use high dimensional vectors 
>>>>> > > will take minutes.
>>>>> > > Improving the hnsw implementation can take months.
>>>>> > > Pick one to begin with...
>>>>> > >
>>>>> > > And there's no-one paying me here, no company interest whatsoever, 
>>>>> > > actually I pay people to contribute, I am just convinced it's a good 
>>>>> > > idea.
>>>>> > >
>>>>> > >
>>>>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote:
>>>>> > >>
>>>>> > >> I disagree with your categorization. I put in plenty of work and
>>>>> > >> experienced plenty of pain myself, writing tests and fighting these
>>>>> > >> issues, after i saw that, two releases in a row, vector indexing fell
>>>>> > >> over and hit integer overflows etc on small datasets:
>>>>> > >>
>>>>> > >> https://github.com/apache/lucene/pull/11905
>>>>> > >>
>>>>> > >> Attacking me isn't helping the situation.
>>>>> > >>
>>>>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>>>> > >> any kind of demeaning fashion really. I meant to describe the current
>>>>> > >> state of usability with respect to indexing a few million docs with
>>>>> > >> high dimensions. You can scroll up the thread and see that at least
>>>>> > >> one other committer on the project experienced similar pain as me.
>>>>> > >> Then, think about users who aren't committers trying to use the
>>>>> > >> functionality!
>>>>> > >>
>>>>> > >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com> 
>>>>> > >> wrote:
>>>>> > >> >
>>>>> > >> > What you said about increasing dimensions requiring a bigger ram 
>>>>> > >> > buffer on merge is wrong. That's the point I was trying to make. 
>>>>> > >> > Your concerns about merge costs are not wrong, but your conclusion 
>>>>> > >> > that we need to limit dimensions is not justified.
>>>>> > >> >
>>>>> > >> > You complain that hnsw sucks it doesn't scale, but when I show it 
>>>>> > >> > scales linearly with dimension you just ignore that and complain 
>>>>> > >> > about something entirely different.
>>>>> > >> >
>>>>> > >> > You demand that people run all kinds of tests to prove you wrong 
>>>>> > >> > but when they do, you don't listen and you won't put in the work 
>>>>> > >> > yourself or complain that it's too hard.
>>>>> > >> >
>>>>> > >> > Then you complain about people not meeting you half way. Wow
>>>>> > >> >
>>>>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> wrote:
>>>>> > >> >>
>>>>> > >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>>>>> > >> >> <michael.wech...@wyona.com> wrote:
>>>>> > >> >> >
>>>>> > >> >> > What exactly do you consider reasonable?
>>>>> > >> >>
>>>>> > >> >> Let's begin a real discussion by being HONEST about the current
>>>>> > >> >> status. Please put politically correct or your own company's 
>>>>> > >> >> wishes
>>>>> > >> >> aside, we know it's not in a good state.
>>>>> > >> >>
>>>>> > >> >> Current status is the one guy who wrote the code can set a
>>>>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>>>> > >> >> dimensions in HOURS (i didn't ask what hardware).
>>>>> > >> >>
>>>>> > >> >> My concerns are everyone else except the one guy, I want it to be
>>>>> > >> >> usable. Increasing dimensions just means even bigger 
>>>>> > >> >> multi-gigabyte
>>>>> > >> >> ram buffer and bigger heap to avoid OOM on merge.
>>>>> > >> >> It is also a permanent backwards compatibility decision, we have 
>>>>> > >> >> to
>>>>> > >> >> support it once we do this and we can't just say "oops" and flip 
>>>>> > >> >> it
>>>>> > >> >> back.
>>>>> > >> >>
>>>>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>>>>> > >> >> avoid merges because they are so slow and it would be DAYS 
>>>>> > >> >> otherwise,
>>>>> > >> >> or if its to avoid merges so it doesn't hit OOM.
>>>>> > >> >> Also from personal experience, it takes trial and error (means
>>>>> > >> >> experiencing OOM on merge!!!) before you get those heap values 
>>>>> > >> >> correct
>>>>> > >> >> for your dataset. This usually means starting over which is
>>>>> > >> >> frustrating and wastes more time.
>>>>> > >> >>
>>>>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter, 
>>>>> > >> >> seems
>>>>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer 
>>>>> > >> >> can be
>>>>> > >> >> avoided in this way and performance improved by writing bigger
>>>>> > >> >> segments with lucene's defaults. But this doesn't mean we can 
>>>>> > >> >> simply
>>>>> > >> >> ignore the horrors of what happens on merge. merging needs to 
>>>>> > >> >> scale so
>>>>> > >> >> that indexing really scales.
>>>>> > >> >>
>>>>> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause 
>>>>> > >> >> OOM,
>>>>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>>>> > >> >> fashion when indexing.
>>>>> > >> >>
>>>>> > >> >> ---------------------------------------------------------------------
>>>>> > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>> > >> >>
>>>>> > >>
>>>>> > >> ---------------------------------------------------------------------
>>>>> > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> > >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>> > >>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Adrien
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>
>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to