Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Sokolov Wed, 12 Apr 2023 05:53:58 -0700

Just addressing [1] I believe there is a simple workaround.  Here's a
unit test demonstrating:


  public void testExcessivelyLargeVector() throws Exception {
    IndexableFieldType vector2048 = new FieldType() {
      @Override
      public int vectorDimension() {
        return 2048;
      }

      @Override
      public VectorEncoding vectorEncoding() {
        return VectorEncoding.FLOAT32;
      }

      @Override
      public VectorSimilarityFunction vectorSimilarityFunction() {
        return VectorSimilarityFunction.EUCLIDEAN;
      }
    };
    try (Directory dir = newDirectory();
         IndexWriter iw = new IndexWriter(dir,
newIndexWriterConfig(null).setCodec(codec))) {
      Document doc = new Document();
      FieldType type = new FieldType(vector2048);
      doc.add(new KnnVectorField("vector2048", new float[2048], type));
      iw.addDocument(doc);
    }
  }

On Wed, Apr 12, 2023 at 8:10 AM Alessandro Benedetti
<[email protected]> wrote:
>
> My tentative of listing here only a set of proposals to then vote, has 
> unfortunately failed.
>
> I appreciate the discussion on better benchmarking hnsw but my feeling is 
> that this discussion is orthogonal to the limit discussion itself, should we 
> create a separate mail thread/github jira issue for that?
>
> At the moment I see at least three lines of activities as an outcome from 
> this (maybe too long) discussion:
>
> 1) [small task] there's a need from a good amount of people of 
> increasing/removing the max limit, as an enabler, to get more users to Lucene 
> and ease adoption for systems Lucene based (Apache Solr, Elasticsearch, 
> OpenSearch)
>
> 2) [medium task] we all want more benchmarks for Lucene vector-based search, 
> with a good variety of vector dimensions and encodings
>
> 3) [big task? ]  some people would like to  improve vector based search 
> peformance because currently not acceptable, it's not clear when and how
>
> A question I have for point 1, does it really need to be a one way door?
> Can't we reduce the max limit in the future if the implementation becomes 
> coupled with certain dimension sizes?
> It's not ideal I agree, but is back-compatibility more important than 
> pragmatic benefits?
> I. E.
> Right now there's no implementation coupled with the max limit - > we 
> remove/increase the limit and get more Users
>
> With Lucene X.Y a clever committer introduces a super nice implementation 
> improvements that unfortunately limit the max size to K.
> Can't we just document it as a breaking change for such release? So at that 
> point we won't support >K vectors but for a reason?
>
> Do we have similar precedents in Lucene?
>
>
>
> On Wed, 12 Apr 2023, 08:36 Michael Wechner, <[email protected]> wrote:
>>
>> thank you very much for your feedback!
>>
>> In a previous post (April 7) you wrote you could make availlable the 47K 
>> ada-002 vectors, which would be great!
>>
>> Would it make sense to setup a public gitub repo, such that others could use 
>> or also contribute vectors?
>>
>> Thanks
>>
>> Michael Wechner
>>
>>
>> Am 12.04.23 um 04:51 schrieb Kent Fitch:
>>
>> I only know some characteristics of the openAI ada-002 vectors, although 
>> they are a very popular as embeddings/text-characterisations as they allow 
>> more accurate/"human meaningful" semantic search results with fewer 
>> dimensions than their predecessors - I've evaluated a few different 
>> embedding models, including some BERT variants, CLIP ViT-L-14 (with 768 
>> dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001 
>> (2048 dims), and ada-002 are qualitatively the best, although that will 
>> certainly change!
>>
>> In any case, ada-002 vectors have interesting characteristics that I think 
>> mean you could confidently create synthetic vectors which would be hard to 
>> distinguish from "real" vectors.  I found this from looking at 47K ada-002 
>> vectors generated across a full year (1994) of newspaper articles from the 
>> Canberra Times and 200K wikipedia articles:
>> - there is no discernible/significant correlation between values in any pair 
>> of dimensions
>> - all but 5 of the 1536 dimensions have an almost identical distribution of 
>> values shown in the central blob on these graphs (that just show a few of 
>> these 1531 dimensions with clumped values and the 5 "outlier" dimensions, 
>> but all 1531 non-outlier dims are in there, which makes for some easy 
>> quantisation from float to byte if you dont want to go the full 
>> kmeans/clustering/Lloyds-algorithm approach):
>> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
>> - the variance of the value of each dimension is characteristic:
>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
>>
>> This probably represents something significant about how the ada-002 
>> embeddings are created, but I think it also means creating "realistic" 
>> values is possible.  I did not use this information when testing recall & 
>> performance on Lucene's HNSW implementation on 192m documents, as I slightly 
>> dithered the values of a "real" set on 47K docs and stored other fields in 
>> the doc that referenced the "base" document that the dithers were made from, 
>> and used different dithering magnitudes so that I could test recall with 
>> different neighbour sizes ("M"), construction-beamwidth and 
>> search-beamwidths.
>>
>> best regards
>>
>> Kent Fitch
>>
>>
>>
>>
>> On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner <[email protected]> 
>> wrote:
>>>
>>> I understand what you mean that it seems to be artificial, but I don't
>>> understand why this matters to test performance and scalability of the
>>> indexing?
>>>
>>> Let's assume the limit of Lucene would be 4 instead of 1024 and there
>>> are only open source models generating vectors with 4 dimensions, for
>>> example
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
>>>
>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
>>>
>>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>
>>> and now I concatenate them to vectors with 8 dimensions
>>>
>>>
>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
>>>
>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
>>>
>>> and normalize them to length 1.
>>>
>>> Why should this be any different to a model which is acting like a black
>>> box generating vectors with 8 dimensions?
>>>
>>>
>>>
>>>
>>> Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>>> >> What exactly do you consider real vector data? Vector data which is 
>>> >> based on texts written by humans?
>>> > We have plenty of text; the problem is coming up with a realistic
>>> > vector model that requires as many dimensions as people seem to be
>>> > demanding. As I said above, after surveying huggingface I couldn't
>>> > find any text-based model using more than 768 dimensions. So far we
>>> > have some ideas of generating higher-dimensional data by dithering or
>>> > concatenating existing data, but it seems artificial.
>>> >
>>> > On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
>>> > <[email protected]> wrote:
>>> >> What exactly do you consider real vector data? Vector data which is 
>>> >> based on texts written by humans?
>>> >>
>>> >> I am asking, because I recently attended the following presentation by 
>>> >> Anastassia Shaitarova (UZH Institute for Computational Linguistics, 
>>> >> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>> >>
>>> >> ----
>>> >>
>>> >> Can we Identify Machine-Generated Text? An Overview of Current Approaches
>>> >> by Anastassia Shaitarova (UZH Institute for Computational Linguistics)
>>> >>
>>> >> The detection of machine-generated text has become increasingly 
>>> >> important due to the prevalence of automated content generation and its 
>>> >> potential for misuse. In this talk, we will discuss the motivation for 
>>> >> automatic detection of generated text. We will present the currently 
>>> >> available methods, including feature-based classification as a “first 
>>> >> line-of-defense.” We will provide an overview of the detection tools 
>>> >> that have been made available so far and discuss their limitations. 
>>> >> Finally, we will reflect on some open problems associated with the 
>>> >> automatic discrimination of generated texts.
>>> >>
>>> >> ----
>>> >>
>>> >> and her conclusion was that it has become basically impossible to 
>>> >> differentiate between text generated by humans and text generated by for 
>>> >> example ChatGPT.
>>> >>
>>> >> Whereas others have a slightly different opinion, see for example
>>> >>
>>> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>>> >>
>>> >> But I would argue that real world and synthetic have become close enough 
>>> >> that testing performance and scalability of indexing should be possible 
>>> >> with synthetic data.
>>> >>
>>> >> I completely agree that we have to base our discussions and decisions on 
>>> >> scientific methods and that we have to make sure that Lucene performs 
>>> >> and scales well and that we understand the limits and what is going on 
>>> >> under the hood.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Michael W
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>>> >>
>>> >> +1 to test on real vector data -- if you test on synthetic data you draw 
>>> >> synthetic conclusions.
>>> >>
>>> >> Can someone post the theoretical performance (CPU and RAM required) of 
>>> >> HNSW construction?  Do we know/believe our HNSW implementation has 
>>> >> achieved that theoretical big-O performance?  Maybe we have some silly 
>>> >> performance bug that's causing it not to?
>>> >>
>>> >> As I understand it, HNSW makes the tradeoff of costly construction for 
>>> >> faster searching, which is typically the right tradeoff for search use 
>>> >> cases.  We do this in other parts of the Lucene index too.
>>> >>
>>> >> Lucene will do a logarithmic number of merges over time, i.e. each doc 
>>> >> will be merged O(log(N)) times in its lifetime in the index.  We need to 
>>> >> multiply that by the cost of re-building the whole HNSW graph on each 
>>> >> merge.  BTW, other things in Lucene, like BKD/dimensional points, also 
>>> >> rebuild the whole data structure on each merge, I think?  But, as Rob 
>>> >> pointed out, stored fields merging do indeed do some sneaky tricks to 
>>> >> avoid excessive block decompress/recompress on each merge.
>>> >>
>>> >>> As I understand it, vetoes must have technical merit. I'm not sure that 
>>> >>> this veto rises to "technical merit" on 2 counts:
>>> >> Actually I think Robert's veto stands on its technical merit already.  
>>> >> Robert's take on technical matters very much resonate with me, even if 
>>> >> he is sometimes prickly in how he expresses them ;)
>>> >>
>>> >> His point is that we, as a dev community, are not paying enough 
>>> >> attention to the indexing performance of our KNN algo (HNSW) and 
>>> >> implementation, and that it is reckless to increase / remove limits in 
>>> >> that state.  It is indeed a one-way door decision and one must confront 
>>> >> such decisions with caution, especially for such a widely used base 
>>> >> infrastructure as Lucene.  We don't even advertise today in our javadocs 
>>> >> that you need XXX heap if you index vectors with dimension Y, fanout X, 
>>> >> levels Z, etc.
>>> >>
>>> >> RAM used during merging is unaffected by dimensionality, but is affected 
>>> >> by fanout, because the HNSW graph (not the raw vectors) is memory 
>>> >> resident, I think?  Maybe we could move it off-heap and let the OS 
>>> >> manage the memory (and still document the RAM requirements)?  Maybe 
>>> >> merge RAM costs should be accounted for in IW's RAM buffer accounting?  
>>> >> It is not today, and there are some other things that use non-trivial 
>>> >> RAM, e.g. the doc mapping (to compress docid space when deletions are 
>>> >> reclaimed).
>>> >>
>>> >> When we added KNN vector testing to Lucene's nightly benchmarks, the 
>>> >> indexing time massively increased -- see annotations DH and DP here: 
>>> >> https://home.apache.org/~mikemccand/lucenebench/indexing.html.  Nightly 
>>> >> benchmarks now start at 6 PM and don't finish until ~14.5 hours later.  
>>> >> Of course, that is using a single thread for indexing (on a box that has 
>>> >> 128 cores!) so we produce a deterministic index every night ...
>>> >>
>>> >> Stepping out (meta) a bit ... this discussion is precisely one of the 
>>> >> awesome benefits of the (informed) veto.  It means risky changes to the 
>>> >> software, as determined by any single informed developer on the project, 
>>> >> can force a healthy discussion about the problem at hand.  Robert is 
>>> >> legitimately concerned about a real issue and so we should use our 
>>> >> creative energies to characterize our HNSW implementation's performance, 
>>> >> document it clearly for users, and uncover ways to improve it.
>>> >>
>>> >> Mike McCandless
>>> >>
>>> >> http://blog.mikemccandless.com
>>> >>
>>> >>
>>> >> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti 
>>> >> <[email protected]> wrote:
>>> >>> I think Gus points are on target.
>>> >>>
>>> >>> I recommend we move this forward in this way:
>>> >>> We stop any discussion and everyone interested proposes an option with 
>>> >>> a motivation, then we aggregate the options and we create a Vote maybe?
>>> >>>
>>> >>> I am also on the same page on the fact that a veto should come with a 
>>> >>> clear and reasonable technical merit, which also in my opinion has not 
>>> >>> come yet.
>>> >>>
>>> >>> I also apologise if any of my words sounded harsh or personal attacks, 
>>> >>> never meant to do so.
>>> >>>
>>> >>> My proposed option:
>>> >>>
>>> >>> 1) remove the limit and potentially make it configurable,
>>> >>> Motivation:
>>> >>> The system administrator can enforce a limit its users need to respect 
>>> >>> that it's in line with whatever the admin decided to be acceptable for 
>>> >>> them.
>>> >>> Default can stay the current one.
>>> >>>
>>> >>> That's my favourite at the moment, but I agree that potentially in the 
>>> >>> future this may need to change, as we may optimise the data structures 
>>> >>> for certain dimensions. I  am a big fan of Yagni (you aren't going to 
>>> >>> need it) so I am ok we'll face a different discussion if that happens 
>>> >>> in the future.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <[email protected]> wrote:
>>> >>>> What I see so far:
>>> >>>>
>>> >>>> Much positive support for raising the limit
>>> >>>> Slightly less support for removing it or making it configurable
>>> >>>> A single veto which argues that a (as yet undefined) performance 
>>> >>>> standard must be met before raising the limit
>>> >>>> Hot tempers (various) making this discussion difficult
>>> >>>>
>>> >>>> As I understand it, vetoes must have technical merit. I'm not sure 
>>> >>>> that this veto rises to "technical merit" on 2 counts:
>>> >>>>
>>> >>>> No standard for the performance is given so it cannot be technically 
>>> >>>> met. Without hard criteria it's a moving target.
>>> >>>> It appears to encode a valuation of the user's time, and that 
>>> >>>> valuation is really up to the user. Some users may consider 2hours 
>>> >>>> useless and not worth it, and others might happily wait 2 hours. This 
>>> >>>> is not a technical decision, it's a business decision regarding the 
>>> >>>> relative value of the time invested vs the value of the result. If I 
>>> >>>> can cure cancer by indexing for a year, that might be worth it... 
>>> >>>> (hyperbole of course).
>>> >>>>
>>> >>>> Things I would consider to have technical merit that I don't hear:
>>> >>>>
>>> >>>> Impact on the speed of **other** indexing operations. (devaluation of 
>>> >>>> other functionality)
>>> >>>> Actual scenarios that work when the limit is low and fail when the 
>>> >>>> limit is high (new failure on the same data with the limit raised).
>>> >>>>
>>> >>>> One thing that might or might not have technical merit
>>> >>>>
>>> >>>> If someone feels there is a lack of documentation of the 
>>> >>>> costs/performance implications of using large vectors, possibly 
>>> >>>> including reproducible benchmarks establishing the scaling behavior 
>>> >>>> (there seems to be disagreement on O(n) vs O(n^2)).
>>> >>>>
>>> >>>> The users *should* know what they are getting into, but if the cost is 
>>> >>>> worth it to them, they should be able to pay it without forking the 
>>> >>>> project. If this veto causes a fork that's not good.
>>> >>>>
>>> >>>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <[email protected]> 
>>> >>>> wrote:
>>> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes in 
>>> >>>>> 100 and 300 dimensional varieties and can easily enough generate 
>>> >>>>> large numbers of vector documents from the articles data. To go 
>>> >>>>> higher we could concatenate vectors from that and I believe the 
>>> >>>>> performance numbers would be plausible.
>>> >>>>>
>>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <[email protected]> 
>>> >>>>> wrote:
>>> >>>>>> Can we set up a branch in which the limit is bumped to 2048, then 
>>> >>>>>> have
>>> >>>>>> a realistic, free data set (wikipedia sample or something) that has,
>>> >>>>>> say, 5 million docs and vectors created using public data (glove
>>> >>>>>> pre-trained embeddings or the like)? We then could run indexing on 
>>> >>>>>> the
>>> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers, 
>>> >>>>>> limits
>>> >>>>>> and behavior actually are.
>>> >>>>>>
>>> >>>>>> I can help in writing this but not until after Easter.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Dawid
>>> >>>>>>
>>> >>>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <[email protected]> 
>>> >>>>>> wrote:
>>> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule for
>>> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and
>>> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people on 
>>> >>>>>>> this
>>> >>>>>>> project who worked the most on debugging subtle bugs, making Lucene
>>> >>>>>>> more robust and improving our test framework, so I'm listening when 
>>> >>>>>>> he
>>> >>>>>>> voices quality concerns.
>>> >>>>>>>
>>> >>>>>>> The argument against removing/raising the limit that resonates with 
>>> >>>>>>> me
>>> >>>>>>> the most is that it is a one-way door. As MikeS highlighted earlier 
>>> >>>>>>> on
>>> >>>>>>> this thread, implementations may want to take advantage of the fact
>>> >>>>>>> that there is a limit at some point too. This is why I don't want to
>>> >>>>>>> remove the limit and would prefer a slight increase, such as 2048 as
>>> >>>>>>> suggested in the original issue, which would enable most of the 
>>> >>>>>>> things
>>> >>>>>>> that users who have been asking about raising the limit would like 
>>> >>>>>>> to
>>> >>>>>>> do.
>>> >>>>>>>
>>> >>>>>>> I agree that the merge-time memory usage and slow indexing rate are
>>> >>>>>>> not great. But it's still possible to index multi-million vector
>>> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the
>>> >>>>>>> number of dimensions, and the feedback I'm seeing is that many users
>>> >>>>>>> are still interested in indexing multi-million vector datasets 
>>> >>>>>>> despite
>>> >>>>>>> the slow indexing rate. I wish we could do better, and vector 
>>> >>>>>>> indexing
>>> >>>>>>> is certainly more expert than text indexing, but it still is usable 
>>> >>>>>>> in
>>> >>>>>>> my opinion. I understand how giving Lucene more information about
>>> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim 
>>> >>>>>>> pointed
>>> >>>>>>> out) could help make merging faster and more memory-efficient, but I
>>> >>>>>>> would really like to avoid making it a requirement for indexing
>>> >>>>>>> vectors as it also makes this feature much harder to use.
>>> >>>>>>>
>>> >>>>>>> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
>>> >>>>>>> <[email protected]> wrote:
>>> >>>>>>>> I am very attentive to listen opinions but I am un-convinced here 
>>> >>>>>>>> and I an not sure that a single person opinion should be allowed 
>>> >>>>>>>> to be detrimental for such an important project.
>>> >>>>>>>>
>>> >>>>>>>> The limit as far as I know is literally just raising an exception.
>>> >>>>>>>> Removing it won't alter in any way the current performance for 
>>> >>>>>>>> users in low dimensional space.
>>> >>>>>>>> Removing it will just enable more users to use Lucene.
>>> >>>>>>>>
>>> >>>>>>>> If new users in certain situations will be unhappy with the 
>>> >>>>>>>> performance, they may contribute improvements.
>>> >>>>>>>> This is how you make progress.
>>> >>>>>>>>
>>> >>>>>>>> If it's a reputation thing, trust me that not allowing users to 
>>> >>>>>>>> play with high dimensional space will equally damage it.
>>> >>>>>>>>
>>> >>>>>>>> To me it's really a no brainer.
>>> >>>>>>>> Removing the limit and enable people to use high dimensional 
>>> >>>>>>>> vectors will take minutes.
>>> >>>>>>>> Improving the hnsw implementation can take months.
>>> >>>>>>>> Pick one to begin with...
>>> >>>>>>>>
>>> >>>>>>>> And there's no-one paying me here, no company interest whatsoever, 
>>> >>>>>>>> actually I pay people to contribute, I am just convinced it's a 
>>> >>>>>>>> good idea.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <[email protected]> wrote:
>>> >>>>>>>>> I disagree with your categorization. I put in plenty of work and
>>> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting 
>>> >>>>>>>>> these
>>> >>>>>>>>> issues, after i saw that, two releases in a row, vector indexing 
>>> >>>>>>>>> fell
>>> >>>>>>>>> over and hit integer overflows etc on small datasets:
>>> >>>>>>>>>
>>> >>>>>>>>> https://github.com/apache/lucene/pull/11905
>>> >>>>>>>>>
>>> >>>>>>>>> Attacking me isn't helping the situation.
>>> >>>>>>>>>
>>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean it 
>>> >>>>>>>>> in
>>> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the 
>>> >>>>>>>>> current
>>> >>>>>>>>> state of usability with respect to indexing a few million docs 
>>> >>>>>>>>> with
>>> >>>>>>>>> high dimensions. You can scroll up the thread and see that at 
>>> >>>>>>>>> least
>>> >>>>>>>>> one other committer on the project experienced similar pain as me.
>>> >>>>>>>>> Then, think about users who aren't committers trying to use the
>>> >>>>>>>>> functionality!
>>> >>>>>>>>>
>>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov 
>>> >>>>>>>>> <[email protected]> wrote:
>>> >>>>>>>>>> What you said about increasing dimensions requiring a bigger ram 
>>> >>>>>>>>>> buffer on merge is wrong. That's the point I was trying to make. 
>>> >>>>>>>>>> Your concerns about merge costs are not wrong, but your 
>>> >>>>>>>>>> conclusion that we need to limit dimensions is not justified.
>>> >>>>>>>>>>
>>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I show 
>>> >>>>>>>>>> it scales linearly with dimension you just ignore that and 
>>> >>>>>>>>>> complain about something entirely different.
>>> >>>>>>>>>>
>>> >>>>>>>>>> You demand that people run all kinds of tests to prove you wrong 
>>> >>>>>>>>>> but when they do, you don't listen and you won't put in the work 
>>> >>>>>>>>>> yourself or complain that it's too hard.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Then you complain about people not meeting you half way. Wow
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <[email protected]> 
>>> >>>>>>>>>> wrote:
>>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>>> >>>>>>>>>>> <[email protected]> wrote:
>>> >>>>>>>>>>>> What exactly do you consider reasonable?
>>> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the current
>>> >>>>>>>>>>> status. Please put politically correct or your own company's 
>>> >>>>>>>>>>> wishes
>>> >>>>>>>>>>> aside, we know it's not in a good state.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Current status is the one guy who wrote the code can set a
>>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024
>>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it to 
>>> >>>>>>>>>>> be
>>> >>>>>>>>>>> usable. Increasing dimensions just means even bigger 
>>> >>>>>>>>>>> multi-gigabyte
>>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>>> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we 
>>> >>>>>>>>>>> have to
>>> >>>>>>>>>>> support it once we do this and we can't just say "oops" and 
>>> >>>>>>>>>>> flip it
>>> >>>>>>>>>>> back.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is really 
>>> >>>>>>>>>>> to
>>> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS 
>>> >>>>>>>>>>> otherwise,
>>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>>> >>>>>>>>>>> Also from personal experience, it takes trial and error (means
>>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap values 
>>> >>>>>>>>>>> correct
>>> >>>>>>>>>>> for your dataset. This usually means starting over which is
>>> >>>>>>>>>>> frustrating and wastes more time.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in IndexWriter, 
>>> >>>>>>>>>>> seems
>>> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram buffer 
>>> >>>>>>>>>>> can be
>>> >>>>>>>>>>> avoided in this way and performance improved by writing bigger
>>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we can 
>>> >>>>>>>>>>> simply
>>> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs to 
>>> >>>>>>>>>>> scale so
>>> >>>>>>>>>>> that indexing really scales.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and 
>>> >>>>>>>>>>> cause OOM,
>>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>> >>>>>>>>>>> fashion when indexing.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> ---------------------------------------------------------------------
>>> >>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>> >>>>>>>>>>> For additional commands, e-mail: [email protected]
>>> >>>>>>>>>>>
>>> >>>>>>>>> ---------------------------------------------------------------------
>>> >>>>>>>>> To unsubscribe, e-mail: [email protected]
>>> >>>>>>>>> For additional commands, e-mail: [email protected]
>>> >>>>>>>>>
>>> >>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Adrien
>>> >>>>>>>
>>> >>>>>>> ---------------------------------------------------------------------
>>> >>>>>>> To unsubscribe, e-mail: [email protected]
>>> >>>>>>> For additional commands, e-mail: [email protected]
>>> >>>>>>>
>>> >>>>>> ---------------------------------------------------------------------
>>> >>>>>> To unsubscribe, e-mail: [email protected]
>>> >>>>>> For additional commands, e-mail: [email protected]
>>> >>>>>>
>>> >>>>
>>> >>>> --
>>> >>>> http://www.needhamsoftware.com (work)
>>> >>>> http://www.the111shift.com (play)
>>> >>
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [email protected]
>>> > For additional commands, e-mail: [email protected]
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to