Re: [Proposal] Remove max number of dimensions for KNN vectors

Alessandro Benedetti Tue, 09 May 2023 09:55:54 -0700

To proceed in a pragmatic way I opened this new thread: *Dimensions Limit
for KNN vectors - Next Steps* .
This is meant to address the main point in this discussion.


For the following points:

2) [medium task] We all want more benchmarks for Lucene vector-based
search, with a good variety of vector dimensions and encodings

3) [big task? ] Some people would like to  improve vector-based search
performance because currently not acceptable, it's not clear when and how

Feel free to create the discussion threads if you believe they are an
immediate priority.

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: [email protected]


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Fri, 21 Apr 2023 at 08:30, Michael Wechner <[email protected]>
wrote:

> yes, they are, whereas it should help us to test performance and
> scalability :-)
>
> Am 21.04.23 um 09:24 schrieb Ishan Chattopadhyaya:
>
> Seems like they were all 768 dimensions.
>
> On Fri, 21 Apr, 2023, 11:48 am Michael Wechner, <[email protected]>
> wrote:
>
>> Hi Together
>>
>> Cohere just published approx. 100Mio embeddings based on Wikipedia content
>>
>> https://txt.cohere.com/embedding-archives-wikipedia/
>>
>> resp.
>>
>> https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
>> https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings
>> ....
>>
>> HTH
>>
>> Michael
>>
>>
>>
>> Am 13.04.23 um 07:58 schrieb Michael Wechner:
>>
>> Hi Kent
>>
>> Great, thank you very much!
>>
>> Will download it later today :-)
>>
>> All the best
>>
>> Michael
>>
>> Am 13.04.23 um 01:35 schrieb Kent Fitch:
>>
>> Hi Michael (and anyone else who wants just over 240K "real world" ada-002
>> vectors of dimension 1536),
>> you are welcome to retrieve a tar.gz file which contains:
>> - 47K embeddings of Canberra Times news article text from 1994
>> - 38K embeddings of the first paragraphs of wikipedia articles about
>> organisations
>> - 156.6K embeddings of the first paragraphs of wikipedia articles about
>> people
>>
>>
>> https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing
>>
>> The file is about 1.7GB and will expand to about 4.4GB. This file will be
>> accessible for at least a week, and I hope you dont hit any google drive
>> download limits trying to retrieve it.
>>
>> The embeddings were generated using my openAI account and you are welcome
>> to use them for any purpose you like.
>>
>> best wishes,
>>
>> Kent Fitch
>>
>> On Wed, Apr 12, 2023 at 4:37 PM Michael Wechner <
>> [email protected]> wrote:
>>
>>> thank you very much for your feedback!
>>>
>>> In a previous post (April 7) you wrote you could make availlable the 47K
>>> ada-002 vectors, which would be great!
>>>
>>> Would it make sense to setup a public gitub repo, such that others could
>>> use or also contribute vectors?
>>>
>>> Thanks
>>>
>>> Michael Wechner
>>>
>>>
>>> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>>>
>>> I only know some characteristics of the openAI ada-002 vectors, although
>>> they are a very popular as embeddings/text-characterisations as they allow
>>> more accurate/"human meaningful" semantic search results with fewer
>>> dimensions than their predecessors - I've evaluated a few different
>>> embedding models, including some BERT variants, CLIP ViT-L-14 (with 768
>>> dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001
>>> (2048 dims), and ada-002 are qualitatively the best, although that will
>>> certainly change!
>>>
>>> In any case, ada-002 vectors have interesting characteristics that I
>>> think mean you could confidently create synthetic vectors which would be
>>> hard to distinguish from "real" vectors.  I found this from looking at 47K
>>> ada-002 vectors generated across a full year (1994) of newspaper articles
>>> from the Canberra Times and 200K wikipedia articles:
>>> - there is no discernible/significant correlation between values in any
>>> pair of dimensions
>>> - all but 5 of the 1536 dimensions have an almost identical distribution
>>> of values shown in the central blob on these graphs (that just show a few
>>> of these 1531 dimensions with clumped values and the 5 "outlier"
>>> dimensions, but all 1531 non-outlier dims are in there, which makes for
>>> some easy quantisation from float to byte if you dont want to go the full
>>> kmeans/clustering/Lloyds-algorithm approach):
>>>
>>> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>>>
>>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>>>
>>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
>>> - the variance of the value of each dimension is characteristic:
>>>
>>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>>>
>>> This probably represents something significant about how the ada-002
>>> embeddings are created, but I think it also means creating "realistic"
>>> values is possible.  I did not use this information when testing recall &
>>> performance on Lucene's HNSW implementation on 192m documents, as I
>>> slightly dithered the values of a "real" set on 47K docs and stored other
>>> fields in the doc that referenced the "base" document that the dithers were
>>> made from, and used different dithering magnitudes so that I could test
>>> recall with different neighbour sizes ("M"), construction-beamwidth and
>>> search-beamwidths.
>>>
>>> best regards
>>>
>>> Kent Fitch
>>>
>>>
>>>
>>>
>>> On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner <
>>> [email protected]> wrote:
>>>
>>>> I understand what you mean that it seems to be artificial, but I don't
>>>> understand why this matters to test performance and scalability of the
>>>> indexing?
>>>>
>>>> Let's assume the limit of Lucene would be 4 instead of 1024 and there
>>>> are only open source models generating vectors with 4 dimensions, for
>>>> example
>>>>
>>>>
>>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>>>
>>>>
>>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>>
>>>>
>>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>>>
>>>>
>>>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>>
>>>> and now I concatenate them to vectors with 8 dimensions
>>>>
>>>>
>>>>
>>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>>
>>>>
>>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>>
>>>> and normalize them to length 1.
>>>>
>>>> Why should this be any different to a model which is acting like a
>>>> black
>>>> box generating vectors with 8 dimensions?
>>>>
>>>>
>>>>
>>>>
>>>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>>>> >> What exactly do you consider real vector data? Vector data which is
>>>> based on texts written by humans?
>>>> > We have plenty of text; the problem is coming up with a realistic
>>>> > vector model that requires as many dimensions as people seem to be
>>>> > demanding. As I said above, after surveying huggingface I couldn't
>>>> > find any text-based model using more than 768 dimensions. So far we
>>>> > have some ideas of generating higher-dimensional data by dithering or
>>>> > concatenating existing data, but it seems artificial.
>>>> >
>>>> > On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
>>>> > <[email protected]> wrote:
>>>> >> What exactly do you consider real vector data? Vector data which is
>>>> based on texts written by humans?
>>>> >>
>>>> >> I am asking, because I recently attended the following presentation
>>>> by Anastassia Shaitarova (UZH Institute for Computational Linguistics,
>>>> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>>> >>
>>>> >> ----
>>>> >>
>>>> >> Can we Identify Machine-Generated Text? An Overview of Current
>>>> Approaches
>>>> >> by Anastassia Shaitarova (UZH Institute for Computational
>>>> Linguistics)
>>>> >>
>>>> >> The detection of machine-generated text has become increasingly
>>>> important due to the prevalence of automated content generation and its
>>>> potential for misuse. In this talk, we will discuss the motivation for
>>>> automatic detection of generated text. We will present the currently
>>>> available methods, including feature-based classification as a “first
>>>> line-of-defense.” We will provide an overview of the detection tools that
>>>> have been made available so far and discuss their limitations. Finally, we
>>>> will reflect on some open problems associated with the automatic
>>>> discrimination of generated texts.
>>>> >>
>>>> >> ----
>>>> >>
>>>> >> and her conclusion was that it has become basically impossible to
>>>> differentiate between text generated by humans and text generated by for
>>>> example ChatGPT.
>>>> >>
>>>> >> Whereas others have a slightly different opinion, see for example
>>>> >>
>>>> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>>>> >>
>>>> >> But I would argue that real world and synthetic have become close
>>>> enough that testing performance and scalability of indexing should be
>>>> possible with synthetic data.
>>>> >>
>>>> >> I completely agree that we have to base our discussions and
>>>> decisions on scientific methods and that we have to make sure that Lucene
>>>> performs and scales well and that we understand the limits and what is
>>>> going on under the hood.
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> Michael W
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>>>> >>
>>>> >> +1 to test on real vector data -- if you test on synthetic data you
>>>> draw synthetic conclusions.
>>>> >>
>>>> >> Can someone post the theoretical performance (CPU and RAM required)
>>>> of HNSW construction?  Do we know/believe our HNSW implementation has
>>>> achieved that theoretical big-O performance?  Maybe we have some silly
>>>> performance bug that's causing it not to?
>>>> >>
>>>> >> As I understand it, HNSW makes the tradeoff of costly construction
>>>> for faster searching, which is typically the right tradeoff for search use
>>>> cases.  We do this in other parts of the Lucene index too.
>>>> >>
>>>> >> Lucene will do a logarithmic number of merges over time, i.e. each
>>>> doc will be merged O(log(N)) times in its lifetime in the index.  We need
>>>> to multiply that by the cost of re-building the whole HNSW graph on each
>>>> merge.  BTW, other things in Lucene, like BKD/dimensional points, also
>>>> rebuild the whole data structure on each merge, I think?  But, as Rob
>>>> pointed out, stored fields merging do indeed do some sneaky tricks to avoid
>>>> excessive block decompress/recompress on each merge.
>>>> >>
>>>> >>> As I understand it, vetoes must have technical merit. I'm not sure
>>>> that this veto rises to "technical merit" on 2 counts:
>>>> >> Actually I think Robert's veto stands on its technical merit
>>>> already.  Robert's take on technical matters very much resonate with me,
>>>> even if he is sometimes prickly in how he expresses them ;)
>>>> >>
>>>> >> His point is that we, as a dev community, are not paying enough
>>>> attention to the indexing performance of our KNN algo (HNSW) and
>>>> implementation, and that it is reckless to increase / remove limits in that
>>>> state.  It is indeed a one-way door decision and one must confront such
>>>> decisions with caution, especially for such a widely used base
>>>> infrastructure as Lucene.  We don't even advertise today in our javadocs
>>>> that you need XXX heap if you index vectors with dimension Y, fanout X,
>>>> levels Z, etc.
>>>> >>
>>>> >> RAM used during merging is unaffected by dimensionality, but is
>>>> affected by fanout, because the HNSW graph (not the raw vectors) is memory
>>>> resident, I think?  Maybe we could move it off-heap and let the OS manage
>>>> the memory (and still document the RAM requirements)?  Maybe merge RAM
>>>> costs should be accounted for in IW's RAM buffer accounting?  It is not
>>>> today, and there are some other things that use non-trivial RAM, e.g. the
>>>> doc mapping (to compress docid space when deletions are reclaimed).
>>>> >>
>>>> >> When we added KNN vector testing to Lucene's nightly benchmarks, the
>>>> indexing time massively increased -- see annotations DH and DP here:
>>>> https://home.apache.org/~mikemccand/lucenebench/indexing.html.
>>>> Nightly benchmarks now start at 6 PM and don't finish until ~14.5 hours
>>>> later.  Of course, that is using a single thread for indexing (on a box
>>>> that has 128 cores!) so we produce a deterministic index every night ...
>>>> >>
>>>> >> Stepping out (meta) a bit ... this discussion is precisely one of
>>>> the awesome benefits of the (informed) veto.  It means risky changes to the
>>>> software, as determined by any single informed developer on the project,
>>>> can force a healthy discussion about the problem at hand.  Robert is
>>>> legitimately concerned about a real issue and so we should use our creative
>>>> energies to characterize our HNSW implementation's performance, document it
>>>> clearly for users, and uncover ways to improve it.
>>>> >>
>>>> >> Mike McCandless
>>>> >>
>>>> >> http://blog.mikemccandless.com
>>>> >>
>>>> >>
>>>> >> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti <
>>>> [email protected]> wrote:
>>>> >>> I think Gus points are on target.
>>>> >>>
>>>> >>> I recommend we move this forward in this way:
>>>> >>> We stop any discussion and everyone interested proposes an option
>>>> with a motivation, then we aggregate the options and we create a Vote 
>>>> maybe?
>>>> >>>
>>>> >>> I am also on the same page on the fact that a veto should come with
>>>> a clear and reasonable technical merit, which also in my opinion has not
>>>> come yet.
>>>> >>>
>>>> >>> I also apologise if any of my words sounded harsh or personal
>>>> attacks, never meant to do so.
>>>> >>>
>>>> >>> My proposed option:
>>>> >>>
>>>> >>> 1) remove the limit and potentially make it configurable,
>>>> >>> Motivation:
>>>> >>> The system administrator can enforce a limit its users need to
>>>> respect that it's in line with whatever the admin decided to be acceptable
>>>> for them.
>>>> >>> Default can stay the current one.
>>>> >>>
>>>> >>> That's my favourite at the moment, but I agree that potentially in
>>>> the future this may need to change, as we may optimise the data structures
>>>> for certain dimensions. I  am a big fan of Yagni (you aren't going to need
>>>> it) so I am ok we'll face a different discussion if that happens in the
>>>> future.
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <[email protected]> wrote:
>>>> >>>> What I see so far:
>>>> >>>>
>>>> >>>> Much positive support for raising the limit
>>>> >>>> Slightly less support for removing it or making it configurable
>>>> >>>> A single veto which argues that a (as yet undefined) performance
>>>> standard must be met before raising the limit
>>>> >>>> Hot tempers (various) making this discussion difficult
>>>> >>>>
>>>> >>>> As I understand it, vetoes must have technical merit. I'm not sure
>>>> that this veto rises to "technical merit" on 2 counts:
>>>> >>>>
>>>> >>>> No standard for the performance is given so it cannot be
>>>> technically met. Without hard criteria it's a moving target.
>>>> >>>> It appears to encode a valuation of the user's time, and that
>>>> valuation is really up to the user. Some users may consider 2hours useless
>>>> and not worth it, and others might happily wait 2 hours. This is not a
>>>> technical decision, it's a business decision regarding the relative value
>>>> of the time invested vs the value of the result. If I can cure cancer by
>>>> indexing for a year, that might be worth it... (hyperbole of course).
>>>> >>>>
>>>> >>>> Things I would consider to have technical merit that I don't hear:
>>>> >>>>
>>>> >>>> Impact on the speed of **other** indexing operations. (devaluation
>>>> of other functionality)
>>>> >>>> Actual scenarios that work when the limit is low and fail when the
>>>> limit is high (new failure on the same data with the limit raised).
>>>> >>>>
>>>> >>>> One thing that might or might not have technical merit
>>>> >>>>
>>>> >>>> If someone feels there is a lack of documentation of the
>>>> costs/performance implications of using large vectors, possibly including
>>>> reproducible benchmarks establishing the scaling behavior (there seems to
>>>> be disagreement on O(n) vs O(n^2)).
>>>> >>>>
>>>> >>>> The users *should* know what they are getting into, but if the
>>>> cost is worth it to them, they should be able to pay it without forking the
>>>> project. If this veto causes a fork that's not good.
>>>> >>>>
>>>> >>>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <[email protected]>
>>>> wrote:
>>>> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes
>>>> in 100 and 300 dimensional varieties and can easily enough generate large
>>>> numbers of vector documents from the articles data. To go higher we could
>>>> concatenate vectors from that and I believe the performance numbers would
>>>> be plausible.
>>>> >>>>>
>>>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <[email protected]>
>>>> wrote:
>>>> >>>>>> Can we set up a branch in which the limit is bumped to 2048,
>>>> then have
>>>> >>>>>> a realistic, free data set (wikipedia sample or something) that
>>>> has,
>>>> >>>>>> say, 5 million docs and vectors created using public data (glove
>>>> >>>>>> pre-trained embeddings or the like)? We then could run indexing
>>>> on the
>>>> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers,
>>>> limits
>>>> >>>>>> and behavior actually are.
>>>> >>>>>>
>>>> >>>>>> I can help in writing this but not until after Easter.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Dawid
>>>> >>>>>>
>>>> >>>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <[email protected]>
>>>> wrote:
>>>> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule
>>>> for
>>>> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and
>>>> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people
>>>> on this
>>>> >>>>>>> project who worked the most on debugging subtle bugs, making
>>>> Lucene
>>>> >>>>>>> more robust and improving our test framework, so I'm listening
>>>> when he
>>>> >>>>>>> voices quality concerns.
>>>> >>>>>>>
>>>> >>>>>>> The argument against removing/raising the limit that resonates
>>>> with me
>>>> >>>>>>> the most is that it is a one-way door. As MikeS highlighted
>>>> earlier on
>>>> >>>>>>> this thread, implementations may want to take advantage of the
>>>> fact
>>>> >>>>>>> that there is a limit at some point too. This is why I don't
>>>> want to
>>>> >>>>>>> remove the limit and would prefer a slight increase, such as
>>>> 2048 as
>>>> >>>>>>> suggested in the original issue, which would enable most of the
>>>> things
>>>> >>>>>>> that users who have been asking about raising the limit would
>>>> like to
>>>> >>>>>>> do.
>>>> >>>>>>>
>>>> >>>>>>> I agree that the merge-time memory usage and slow indexing rate
>>>> are
>>>> >>>>>>> not great. But it's still possible to index multi-million vector
>>>> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
>>>> >>>>>>> number of dimensions, and the feedback I'm seeing is that many
>>>> users
>>>> >>>>>>> are still interested in indexing multi-million vector datasets
>>>> despite
>>>> >>>>>>> the slow indexing rate. I wish we could do better, and vector
>>>> indexing
>>>> >>>>>>> is certainly more expert than text indexing, but it still is
>>>> usable in
>>>> >>>>>>> my opinion. I understand how giving Lucene more information
>>>> about
>>>> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim
>>>> pointed
>>>> >>>>>>> out) could help make merging faster and more memory-efficient,
>>>> but I
>>>> >>>>>>> would really like to avoid making it a requirement for indexing
>>>> >>>>>>> vectors as it also makes this feature much harder to use.
>>>> >>>>>>>
>>>> >>>>>>> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
>>>> >>>>>>> <[email protected]> wrote:
>>>> >>>>>>>> I am very attentive to listen opinions but I am un-convinced
>>>> here and I an not sure that a single person opinion should be allowed to be
>>>> detrimental for such an important project.
>>>> >>>>>>>>
>>>> >>>>>>>> The limit as far as I know is literally just raising an
>>>> exception.
>>>> >>>>>>>> Removing it won't alter in any way the current performance for
>>>> users in low dimensional space.
>>>> >>>>>>>> Removing it will just enable more users to use Lucene.
>>>> >>>>>>>>
>>>> >>>>>>>> If new users in certain situations will be unhappy with the
>>>> performance, they may contribute improvements.
>>>> >>>>>>>> This is how you make progress.
>>>> >>>>>>>>
>>>> >>>>>>>> If it's a reputation thing, trust me that not allowing users
>>>> to play with high dimensional space will equally damage it.
>>>> >>>>>>>>
>>>> >>>>>>>> To me it's really a no brainer.
>>>> >>>>>>>> Removing the limit and enable people to use high dimensional
>>>> vectors will take minutes.
>>>> >>>>>>>> Improving the hnsw implementation can take months.
>>>> >>>>>>>> Pick one to begin with...
>>>> >>>>>>>>
>>>> >>>>>>>> And there's no-one paying me here, no company interest
>>>> whatsoever, actually I pay people to contribute, I am just convinced it's a
>>>> good idea.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <[email protected]>
>>>> wrote:
>>>> >>>>>>>>> I disagree with your categorization. I put in plenty of work
>>>> and
>>>> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting
>>>> these
>>>> >>>>>>>>> issues, after i saw that, two releases in a row, vector
>>>> indexing fell
>>>> >>>>>>>>> over and hit integer overflows etc on small datasets:
>>>> >>>>>>>>>
>>>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>>>> >>>>>>>>>
>>>> >>>>>>>>> Attacking me isn't helping the situation.
>>>> >>>>>>>>>
>>>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't
>>>> mean it in
>>>> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the
>>>> current
>>>> >>>>>>>>> state of usability with respect to indexing a few million
>>>> docs with
>>>> >>>>>>>>> high dimensions. You can scroll up the thread and see that at
>>>> least
>>>> >>>>>>>>> one other committer on the project experienced similar pain
>>>> as me.
>>>> >>>>>>>>> Then, think about users who aren't committers trying to use
>>>> the
>>>> >>>>>>>>> functionality!
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <
>>>> [email protected]> wrote:
>>>> >>>>>>>>>> What you said about increasing dimensions requiring a bigger
>>>> ram buffer on merge is wrong. That's the point I was trying to make. Your
>>>> concerns about merge costs are not wrong, but your conclusion that we need
>>>> to limit dimensions is not justified.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I
>>>> show it scales linearly with dimension you just ignore that and complain
>>>> about something entirely different.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> You demand that people run all kinds of tests to prove you
>>>> wrong but when they do, you don't listen and you won't put in the work
>>>> yourself or complain that it's too hard.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Then you complain about people not meeting you half way. Wow
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <[email protected]>
>>>> wrote:
>>>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>>>> >>>>>>>>>>> <[email protected]> wrote:
>>>> >>>>>>>>>>>> What exactly do you consider reasonable?
>>>> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the
>>>> current
>>>> >>>>>>>>>>> status. Please put politically correct or your own
>>>> company's wishes
>>>> >>>>>>>>>>> aside, we know it's not in a good state.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Current status is the one guy who wrote the code can set a
>>>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with
>>>> 1024
>>>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it
>>>> to be
>>>> >>>>>>>>>>> usable. Increasing dimensions just means even bigger
>>>> multi-gigabyte
>>>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>>>> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we
>>>> have to
>>>> >>>>>>>>>>> support it once we do this and we can't just say "oops" and
>>>> flip it
>>>> >>>>>>>>>>> back.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is
>>>> really to
>>>> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS
>>>> otherwise,
>>>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>>>> >>>>>>>>>>> Also from personal experience, it takes trial and error
>>>> (means
>>>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap
>>>> values correct
>>>> >>>>>>>>>>> for your dataset. This usually means starting over which is
>>>> >>>>>>>>>>> frustrating and wastes more time.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in
>>>> IndexWriter, seems
>>>> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram
>>>> buffer can be
>>>> >>>>>>>>>>> avoided in this way and performance improved by writing
>>>> bigger
>>>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we
>>>> can simply
>>>> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs
>>>> to scale so
>>>> >>>>>>>>>>> that indexing really scales.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and
>>>> cause OOM,
>>>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in
>>>> O(n^2)
>>>> >>>>>>>>>>> fashion when indexing.
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>> >>>>>>>>>>> For additional commands, e-mail: [email protected]
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>> >>>>>>>>> For additional commands, e-mail: [email protected]
>>>> >>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> --
>>>> >>>>>>> Adrien
>>>> >>>>>>>
>>>> >>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>> To unsubscribe, e-mail: [email protected]
>>>> >>>>>>> For additional commands, e-mail: [email protected]
>>>> >>>>>>>
>>>> >>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>> To unsubscribe, e-mail: [email protected]
>>>> >>>>>> For additional commands, e-mail: [email protected]
>>>> >>>>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> http://www.needhamsoftware.com (work)
>>>> >>>> http://www.the111shift.com (play)
>>>> >>
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: [email protected]
>>>> > For additional commands, e-mail: [email protected]
>>>> >
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>
>>
>

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to