What exactly do you consider real vector data? Vector data which is based on texts written by humans?

I am asking, because I recently attended the following presentation by Anastassia Shaitarova (UZH Institute for Computational Linguistics, https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)

*----
*

*Can we Identify Machine-Generated Text? An Overview of Current Approaches*
by Anastassia Shaitarova (UZH Institute for Computational Linguistics)

/The detection of machine-generated text has become increasingly important due to the prevalence of automated content generation and its potential for misuse. In this talk, we will discuss the motivation for automatic detection of generated text. We will present the currently available methods, including feature-based classification as a “first line-of-defense.” We will provide an overview of the detection tools that have been made available so far and discuss their limitations. Finally, we will reflect on some open problems associated with the automatic discrimination of generated texts./

/----/

and her conclusion was that it has become basically impossible to differentiate between text generated by humans and text generated by for example ChatGPT.

Whereas others have a slightly different opinion, see for example

https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/

But I would argue that real world and synthetic have become close enough that testing performance and scalability of indexing should be possible with synthetic data.

I completely agree that we have to base our discussions and decisions on scientific methods and that we have to make sure that Lucene performs and scales well and that we understand the limits and what is going on under the hood.

Thanks

Michael W





Am 11.04.23 um 14:29 schrieb Michael McCandless:
+1 to test on real vector data -- if you test on synthetic data you draw synthetic conclusions.

Can someone post the theoretical performance (CPU and RAM required) of HNSW construction?  Do we know/believe our HNSW implementation has achieved that theoretical big-O performance?  Maybe we have some silly performance bug that's causing it not to?

As I understand it, HNSW makes the tradeoff of costly construction for faster searching, which is typically the right tradeoff for search use cases.  We do this in other parts of the Lucene index too.

Lucene will do a logarithmic number of merges over time, i.e. each doc will be merged O(log(N)) times in its lifetime in the index.  We need to multiply that by the cost of re-building the whole HNSW graph on each merge.  BTW, other things in Lucene, like BKD/dimensional points, also rebuild the whole data structure on each merge, I think?  But, as Rob pointed out, stored fields merging do indeed do some sneaky tricks to avoid excessive block decompress/recompress on each merge.

> As I understand it, vetoes must have technical merit. I'm not sure that this veto rises to "technical merit" on 2 counts:

Actually I think Robert's veto stands on its technical merit already.  Robert's take on technical matters very much resonate with me, even if he is sometimes prickly in how he expresses them ;)

His point is that we, as a dev community, are not paying enough attention to the indexing performance of our KNN algo (HNSW) and implementation, and that it is reckless to increase / remove limits in that state.  It is indeed a one-way door decision and one must confront such decisions with caution, especially for such a widely used base infrastructure as Lucene.  We don't even advertise today in our javadocs that you need XXX heap if you index vectors with dimension Y, fanout X, levels Z, etc.

RAM used during merging is unaffected by dimensionality, but is affected by fanout, because the HNSW graph (not the raw vectors) is memory resident, I think?  Maybe we could move it off-heap and let the OS manage the memory (and still document the RAM requirements)?  Maybe merge RAM costs should be accounted for in IW's RAM buffer accounting?  It is not today, and there are some other things that use non-trivial RAM, e.g. the doc mapping (to compress docid space when deletions are reclaimed).

When we added KNN vector testing to Lucene's nightly benchmarks, the indexing time massively increased -- see annotations DH and DP here: https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly benchmarks now start at 6 PM and don't finish until ~14.5 hours later.  Of course, that is using a single thread for indexing (on a box that has 128 cores!) so we produce a deterministic index every night ...

Stepping out (meta) a bit ... this discussion is precisely one of the awesome benefits of the (informed) veto.  It means risky changes to the software, as determined by any single informed developer on the project, can force a healthy discussion about the problem at hand.  Robert is legitimately concerned about a real issue and so we should use our creative energies to characterize our HNSW implementation's performance, document it clearly for users, and uncover ways to improve it.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti <a.benede...@sease.io> wrote:

    I think Gus points are on target.

    I recommend we move this forward in this way:
    We stop any discussion and everyone interested proposes an option
    with a motivation, then we aggregate the options and we create a
    Vote maybe?

    I am also on the same page on the fact that a veto should come
    with a clear and reasonable technical merit, which also in my
    opinion has not come yet.

    I also apologise if any of my words sounded harsh or personal
    attacks, never meant to do so.

    My proposed option:

    1) remove the limit and potentially make it configurable,
    Motivation:
    The system administrator can enforce a limit its users need to
    respect that it's in line with whatever the admin decided to be
    acceptable for them.
    Default can stay the current one.

    That's my favourite at the moment, but I agree that potentially in
    the future this may need to change, as we may optimise the data
    structures for certain dimensions. I  am a big fan of Yagni (you
    aren't going to need it) so I am ok we'll face a different
    discussion if that happens in the future.



    On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.h...@gmail.com> wrote:

        What I see so far:

         1. Much positive support for raising the limit
         2. Slightly less support for removing it or making it
            configurable
         3. A single veto which argues that a (as yet undefined)
            performance standard must be met before raising the limit
         4. Hot tempers (various) making this discussion difficult

        As I understand it, vetoes must have technical merit. I'm not
        sure that this veto rises to "technical merit" on 2 counts:

         1. No standard for the performance is given so it cannot be
            technically met. Without hard criteria it's a moving target.
         2. It appears to encode a valuation of the user's time, and
            that valuation is really up to the user. Some users may
            consider 2hours useless and not worth it, and others might
            happily wait 2 hours. This is not a technical decision,
            it's a business decision regarding the relative value of
            the time invested vs the value of the result. If I can
            cure cancer by indexing for a year, that might be worth
            it... (hyperbole of course).

        Things I would consider to have technical merit that I don't hear:

         1. Impact on the speed of **other** indexing operations.
            (devaluation of other functionality)
         2. Actual scenarios that work when the limit is low and fail
            when the limit is high (new failure on the same data with
            the limit raised).

        One thing that might or might not have technical merit

         1. If someone feels there is a lack of documentation of the
            costs/performance implications of using large vectors,
            possibly including reproducible benchmarks establishing
            the scaling behavior (there seems to be disagreement on
            O(n) vs O(n^2)).

        The users *should* know what they are getting into, but if the
        cost is worth it to them, they should be able to pay it
        without forking the project. If this veto causes a fork that's
        not good.

        On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov
        <msoko...@gmail.com> wrote:

            We do have a dataset built from Wikipedia in luceneutil.
            It comes in 100 and 300 dimensional varieties and can
            easily enough generate large numbers of vector documents
            from the articles data. To go higher we could concatenate
            vectors from that and I believe the performance numbers
            would be plausible.

            On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss
            <dawid.we...@gmail.com> wrote:

                Can we set up a branch in which the limit is bumped to
                2048, then have
                a realistic, free data set (wikipedia sample or
                something) that has,
                say, 5 million docs and vectors created using public
                data (glove
                pre-trained embeddings or the like)? We then could run
                indexing on the
                same hardware with 512, 1024 and 2048 and see what the
                numbers, limits
                and behavior actually are.

                I can help in writing this but not until after Easter.


                Dawid

                On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand
                <jpou...@gmail.com> wrote:
                >
                > As Dawid pointed out earlier on this thread, this is
                the rule for
                > Apache projects: a single -1 vote on a code change
                is a veto and
                > cannot be overridden. Furthermore, Robert is one of
                the people on this
                > project who worked the most on debugging subtle
                bugs, making Lucene
                > more robust and improving our test framework, so I'm
                listening when he
                > voices quality concerns.
                >
                > The argument against removing/raising the limit that
                resonates with me
                > the most is that it is a one-way door. As MikeS
                highlighted earlier on
                > this thread, implementations may want to take
                advantage of the fact
                > that there is a limit at some point too. This is why
                I don't want to
                > remove the limit and would prefer a slight increase,
                such as 2048 as
                > suggested in the original issue, which would enable
                most of the things
                > that users who have been asking about raising the
                limit would like to
                > do.
                >
                > I agree that the merge-time memory usage and slow
                indexing rate are
                > not great. But it's still possible to index
                multi-million vector
                > datasets with a 4GB heap without hitting OOMEs
                regardless of the
                > number of dimensions, and the feedback I'm seeing is
                that many users
                > are still interested in indexing multi-million
                vector datasets despite
                > the slow indexing rate. I wish we could do better,
                and vector indexing
                > is certainly more expert than text indexing, but it
                still is usable in
                > my opinion. I understand how giving Lucene more
                information about
                > vectors prior to indexing (e.g. clustering
                information as Jim pointed
                > out) could help make merging faster and more
                memory-efficient, but I
                > would really like to avoid making it a requirement
                for indexing
                > vectors as it also makes this feature much harder to
                use.
                >
                > On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
                > <a.benede...@sease.io> wrote:
                > >
                > > I am very attentive to listen opinions but I am
                un-convinced here and I an not sure that a single
                person opinion should be allowed to be detrimental for
                such an important project.
                > >
                > > The limit as far as I know is literally just
                raising an exception.
                > > Removing it won't alter in any way the current
                performance for users in low dimensional space.
                > > Removing it will just enable more users to use Lucene.
                > >
                > > If new users in certain situations will be unhappy
                with the performance, they may contribute improvements.
                > > This is how you make progress.
                > >
                > > If it's a reputation thing, trust me that not
                allowing users to play with high dimensional space
                will equally damage it.
                > >
                > > To me it's really a no brainer.
                > > Removing the limit and enable people to use high
                dimensional vectors will take minutes.
                > > Improving the hnsw implementation can take months.
                > > Pick one to begin with...
                > >
                > > And there's no-one paying me here, no company
                interest whatsoever, actually I pay people to
                contribute, I am just convinced it's a good idea.
                > >
                > >
                > > On Sat, 8 Apr 2023, 18:57 Robert Muir,
                <rcm...@gmail.com> wrote:
                > >>
                > >> I disagree with your categorization. I put in
                plenty of work and
                > >> experienced plenty of pain myself, writing tests
                and fighting these
                > >> issues, after i saw that, two releases in a row,
                vector indexing fell
                > >> over and hit integer overflows etc on small datasets:
                > >>
                > >> https://github.com/apache/lucene/pull/11905
                > >>
                > >> Attacking me isn't helping the situation.
                > >>
                > >> PS: when i said the "one guy who wrote the code"
                I didn't mean it in
                > >> any kind of demeaning fashion really. I meant to
                describe the current
                > >> state of usability with respect to indexing a few
                million docs with
                > >> high dimensions. You can scroll up the thread and
                see that at least
                > >> one other committer on the project experienced
                similar pain as me.
                > >> Then, think about users who aren't committers
                trying to use the
                > >> functionality!
                > >>
                > >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov
                <msoko...@gmail.com> wrote:
                > >> >
                > >> > What you said about increasing dimensions
                requiring a bigger ram buffer on merge is wrong.
                That's the point I was trying to make. Your concerns
                about merge costs are not wrong, but your conclusion
                that we need to limit dimensions is not justified.
                > >> >
                > >> > You complain that hnsw sucks it doesn't scale,
                but when I show it scales linearly with dimension you
                just ignore that and complain about something entirely
                different.
                > >> >
                > >> > You demand that people run all kinds of tests
                to prove you wrong but when they do, you don't listen
                and you won't put in the work yourself or complain
                that it's too hard.
                > >> >
                > >> > Then you complain about people not meeting you
                half way. Wow
                > >> >
                > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir
                <rcm...@gmail.com> wrote:
                > >> >>
                > >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
                > >> >> <michael.wech...@wyona.com> wrote:
                > >> >> >
                > >> >> > What exactly do you consider reasonable?
                > >> >>
                > >> >> Let's begin a real discussion by being HONEST
                about the current
                > >> >> status. Please put politically correct or your
                own company's wishes
                > >> >> aside, we know it's not in a good state.
                > >> >>
                > >> >> Current status is the one guy who wrote the
                code can set a
                > >> >> multi-gigabyte ram buffer and index a small
                dataset with 1024
                > >> >> dimensions in HOURS (i didn't ask what hardware).
                > >> >>
                > >> >> My concerns are everyone else except the one
                guy, I want it to be
                > >> >> usable. Increasing dimensions just means even
                bigger multi-gigabyte
                > >> >> ram buffer and bigger heap to avoid OOM on merge.
                > >> >> It is also a permanent backwards compatibility
                decision, we have to
                > >> >> support it once we do this and we can't just
                say "oops" and flip it
                > >> >> back.
                > >> >>
                > >> >> It is unclear to me, if the multi-gigabyte ram
                buffer is really to
                > >> >> avoid merges because they are so slow and it
                would be DAYS otherwise,
                > >> >> or if its to avoid merges so it doesn't hit OOM.
                > >> >> Also from personal experience, it takes trial
                and error (means
                > >> >> experiencing OOM on merge!!!) before you get
                those heap values correct
                > >> >> for your dataset. This usually means starting
                over which is
                > >> >> frustrating and wastes more time.
                > >> >>
                > >> >> Jim mentioned some ideas about the memory
                usage in IndexWriter, seems
                > >> >> to me like its a good idea. maybe the
                multigigabyte ram buffer can be
                > >> >> avoided in this way and performance improved
                by writing bigger
                > >> >> segments with lucene's defaults. But this
                doesn't mean we can simply
                > >> >> ignore the horrors of what happens on merge.
                merging needs to scale so
                > >> >> that indexing really scales.
                > >> >>
                > >> >> At least it shouldnt spike RAM on trivial data
                amounts and cause OOM,
                > >> >> and definitely it shouldnt burn hours and
                hours of CPU in O(n^2)
                > >> >> fashion when indexing.
                > >> >>
                > >> >>
                
---------------------------------------------------------------------
                > >> >> To unsubscribe, e-mail:
                dev-unsubscr...@lucene.apache.org
                > >> >> For additional commands, e-mail:
                dev-h...@lucene.apache.org
                > >> >>
                > >>
                > >>
                
---------------------------------------------------------------------
                > >> To unsubscribe, e-mail:
                dev-unsubscr...@lucene.apache.org
                > >> For additional commands, e-mail:
                dev-h...@lucene.apache.org
                > >>
                >
                >
                > --
                > Adrien
                >
                >
                
---------------------------------------------------------------------
                > To unsubscribe, e-mail:
                dev-unsubscr...@lucene.apache.org
                > For additional commands, e-mail:
                dev-h...@lucene.apache.org
                >

                
---------------------------------------------------------------------
                To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
                For additional commands, e-mail:
                dev-h...@lucene.apache.org



-- http://www.needhamsoftware.com (work)
        http://www.the111shift.com (play)

Reply via email to