To proceed in a pragmatic way I opened this new thread: *Dimensions Limit for KNN vectors - Next Steps* . This is meant to address the main point in this discussion.
For the following points: 2) [medium task] We all want more benchmarks for Lucene vector-based search, with a good variety of vector dimensions and encodings 3) [big task? ] Some people would like to improve vector-based search performance because currently not acceptable, it's not clear when and how Feel free to create the discussion threads if you believe they are an immediate priority. Cheers -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Fri, 21 Apr 2023 at 08:30, Michael Wechner <michael.wech...@wyona.com> wrote: > yes, they are, whereas it should help us to test performance and > scalability :-) > > Am 21.04.23 um 09:24 schrieb Ishan Chattopadhyaya: > > Seems like they were all 768 dimensions. > > On Fri, 21 Apr, 2023, 11:48 am Michael Wechner, <michael.wech...@wyona.com> > wrote: > >> Hi Together >> >> Cohere just published approx. 100Mio embeddings based on Wikipedia content >> >> https://txt.cohere.com/embedding-archives-wikipedia/ >> >> resp. >> >> https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings >> https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings >> .... >> >> HTH >> >> Michael >> >> >> >> Am 13.04.23 um 07:58 schrieb Michael Wechner: >> >> Hi Kent >> >> Great, thank you very much! >> >> Will download it later today :-) >> >> All the best >> >> Michael >> >> Am 13.04.23 um 01:35 schrieb Kent Fitch: >> >> Hi Michael (and anyone else who wants just over 240K "real world" ada-002 >> vectors of dimension 1536), >> you are welcome to retrieve a tar.gz file which contains: >> - 47K embeddings of Canberra Times news article text from 1994 >> - 38K embeddings of the first paragraphs of wikipedia articles about >> organisations >> - 156.6K embeddings of the first paragraphs of wikipedia articles about >> people >> >> >> https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing >> >> The file is about 1.7GB and will expand to about 4.4GB. This file will be >> accessible for at least a week, and I hope you dont hit any google drive >> download limits trying to retrieve it. >> >> The embeddings were generated using my openAI account and you are welcome >> to use them for any purpose you like. >> >> best wishes, >> >> Kent Fitch >> >> On Wed, Apr 12, 2023 at 4:37 PM Michael Wechner < >> michael.wech...@wyona.com> wrote: >> >>> thank you very much for your feedback! >>> >>> In a previous post (April 7) you wrote you could make availlable the 47K >>> ada-002 vectors, which would be great! >>> >>> Would it make sense to setup a public gitub repo, such that others could >>> use or also contribute vectors? >>> >>> Thanks >>> >>> Michael Wechner >>> >>> >>> Am 12.04.23 um 04:51 schrieb Kent Fitch: >>> >>> I only know some characteristics of the openAI ada-002 vectors, although >>> they are a very popular as embeddings/text-characterisations as they allow >>> more accurate/"human meaningful" semantic search results with fewer >>> dimensions than their predecessors - I've evaluated a few different >>> embedding models, including some BERT variants, CLIP ViT-L-14 (with 768 >>> dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001 >>> (2048 dims), and ada-002 are qualitatively the best, although that will >>> certainly change! >>> >>> In any case, ada-002 vectors have interesting characteristics that I >>> think mean you could confidently create synthetic vectors which would be >>> hard to distinguish from "real" vectors. I found this from looking at 47K >>> ada-002 vectors generated across a full year (1994) of newspaper articles >>> from the Canberra Times and 200K wikipedia articles: >>> - there is no discernible/significant correlation between values in any >>> pair of dimensions >>> - all but 5 of the 1536 dimensions have an almost identical distribution >>> of values shown in the central blob on these graphs (that just show a few >>> of these 1531 dimensions with clumped values and the 5 "outlier" >>> dimensions, but all 1531 non-outlier dims are in there, which makes for >>> some easy quantisation from float to byte if you dont want to go the full >>> kmeans/clustering/Lloyds-algorithm approach): >>> >>> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing >>> >>> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing >>> >>> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing >>> - the variance of the value of each dimension is characteristic: >>> >>> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228 >>> >>> This probably represents something significant about how the ada-002 >>> embeddings are created, but I think it also means creating "realistic" >>> values is possible. I did not use this information when testing recall & >>> performance on Lucene's HNSW implementation on 192m documents, as I >>> slightly dithered the values of a "real" set on 47K docs and stored other >>> fields in the doc that referenced the "base" document that the dithers were >>> made from, and used different dithering magnitudes so that I could test >>> recall with different neighbour sizes ("M"), construction-beamwidth and >>> search-beamwidths. >>> >>> best regards >>> >>> Kent Fitch >>> >>> >>> >>> >>> On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner < >>> michael.wech...@wyona.com> wrote: >>> >>>> I understand what you mean that it seems to be artificial, but I don't >>>> understand why this matters to test performance and scalability of the >>>> indexing? >>>> >>>> Let's assume the limit of Lucene would be 4 instead of 1024 and there >>>> are only open source models generating vectors with 4 dimensions, for >>>> example >>>> >>>> >>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814 >>>> >>>> >>>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844 >>>> >>>> >>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106 >>>> >>>> >>>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114 >>>> >>>> and now I concatenate them to vectors with 8 dimensions >>>> >>>> >>>> >>>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844 >>>> >>>> >>>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114 >>>> >>>> and normalize them to length 1. >>>> >>>> Why should this be any different to a model which is acting like a >>>> black >>>> box generating vectors with 8 dimensions? >>>> >>>> >>>> >>>> >>>> Am 11.04.23 um 19:05 schrieb Michael Sokolov: >>>> >> What exactly do you consider real vector data? Vector data which is >>>> based on texts written by humans? >>>> > We have plenty of text; the problem is coming up with a realistic >>>> > vector model that requires as many dimensions as people seem to be >>>> > demanding. As I said above, after surveying huggingface I couldn't >>>> > find any text-based model using more than 768 dimensions. So far we >>>> > have some ideas of generating higher-dimensional data by dithering or >>>> > concatenating existing data, but it seems artificial. >>>> > >>>> > On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner >>>> > <michael.wech...@wyona.com> wrote: >>>> >> What exactly do you consider real vector data? Vector data which is >>>> based on texts written by humans? >>>> >> >>>> >> I am asking, because I recently attended the following presentation >>>> by Anastassia Shaitarova (UZH Institute for Computational Linguistics, >>>> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html) >>>> >> >>>> >> ---- >>>> >> >>>> >> Can we Identify Machine-Generated Text? An Overview of Current >>>> Approaches >>>> >> by Anastassia Shaitarova (UZH Institute for Computational >>>> Linguistics) >>>> >> >>>> >> The detection of machine-generated text has become increasingly >>>> important due to the prevalence of automated content generation and its >>>> potential for misuse. In this talk, we will discuss the motivation for >>>> automatic detection of generated text. We will present the currently >>>> available methods, including feature-based classification as a “first >>>> line-of-defense.” We will provide an overview of the detection tools that >>>> have been made available so far and discuss their limitations. Finally, we >>>> will reflect on some open problems associated with the automatic >>>> discrimination of generated texts. >>>> >> >>>> >> ---- >>>> >> >>>> >> and her conclusion was that it has become basically impossible to >>>> differentiate between text generated by humans and text generated by for >>>> example ChatGPT. >>>> >> >>>> >> Whereas others have a slightly different opinion, see for example >>>> >> >>>> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/ >>>> >> >>>> >> But I would argue that real world and synthetic have become close >>>> enough that testing performance and scalability of indexing should be >>>> possible with synthetic data. >>>> >> >>>> >> I completely agree that we have to base our discussions and >>>> decisions on scientific methods and that we have to make sure that Lucene >>>> performs and scales well and that we understand the limits and what is >>>> going on under the hood. >>>> >> >>>> >> Thanks >>>> >> >>>> >> Michael W >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless: >>>> >> >>>> >> +1 to test on real vector data -- if you test on synthetic data you >>>> draw synthetic conclusions. >>>> >> >>>> >> Can someone post the theoretical performance (CPU and RAM required) >>>> of HNSW construction? Do we know/believe our HNSW implementation has >>>> achieved that theoretical big-O performance? Maybe we have some silly >>>> performance bug that's causing it not to? >>>> >> >>>> >> As I understand it, HNSW makes the tradeoff of costly construction >>>> for faster searching, which is typically the right tradeoff for search use >>>> cases. We do this in other parts of the Lucene index too. >>>> >> >>>> >> Lucene will do a logarithmic number of merges over time, i.e. each >>>> doc will be merged O(log(N)) times in its lifetime in the index. We need >>>> to multiply that by the cost of re-building the whole HNSW graph on each >>>> merge. BTW, other things in Lucene, like BKD/dimensional points, also >>>> rebuild the whole data structure on each merge, I think? But, as Rob >>>> pointed out, stored fields merging do indeed do some sneaky tricks to avoid >>>> excessive block decompress/recompress on each merge. >>>> >> >>>> >>> As I understand it, vetoes must have technical merit. I'm not sure >>>> that this veto rises to "technical merit" on 2 counts: >>>> >> Actually I think Robert's veto stands on its technical merit >>>> already. Robert's take on technical matters very much resonate with me, >>>> even if he is sometimes prickly in how he expresses them ;) >>>> >> >>>> >> His point is that we, as a dev community, are not paying enough >>>> attention to the indexing performance of our KNN algo (HNSW) and >>>> implementation, and that it is reckless to increase / remove limits in that >>>> state. It is indeed a one-way door decision and one must confront such >>>> decisions with caution, especially for such a widely used base >>>> infrastructure as Lucene. We don't even advertise today in our javadocs >>>> that you need XXX heap if you index vectors with dimension Y, fanout X, >>>> levels Z, etc. >>>> >> >>>> >> RAM used during merging is unaffected by dimensionality, but is >>>> affected by fanout, because the HNSW graph (not the raw vectors) is memory >>>> resident, I think? Maybe we could move it off-heap and let the OS manage >>>> the memory (and still document the RAM requirements)? Maybe merge RAM >>>> costs should be accounted for in IW's RAM buffer accounting? It is not >>>> today, and there are some other things that use non-trivial RAM, e.g. the >>>> doc mapping (to compress docid space when deletions are reclaimed). >>>> >> >>>> >> When we added KNN vector testing to Lucene's nightly benchmarks, the >>>> indexing time massively increased -- see annotations DH and DP here: >>>> https://home.apache.org/~mikemccand/lucenebench/indexing.html. >>>> Nightly benchmarks now start at 6 PM and don't finish until ~14.5 hours >>>> later. Of course, that is using a single thread for indexing (on a box >>>> that has 128 cores!) so we produce a deterministic index every night ... >>>> >> >>>> >> Stepping out (meta) a bit ... this discussion is precisely one of >>>> the awesome benefits of the (informed) veto. It means risky changes to the >>>> software, as determined by any single informed developer on the project, >>>> can force a healthy discussion about the problem at hand. Robert is >>>> legitimately concerned about a real issue and so we should use our creative >>>> energies to characterize our HNSW implementation's performance, document it >>>> clearly for users, and uncover ways to improve it. >>>> >> >>>> >> Mike McCandless >>>> >> >>>> >> http://blog.mikemccandless.com >>>> >> >>>> >> >>>> >> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti < >>>> a.benede...@sease.io> wrote: >>>> >>> I think Gus points are on target. >>>> >>> >>>> >>> I recommend we move this forward in this way: >>>> >>> We stop any discussion and everyone interested proposes an option >>>> with a motivation, then we aggregate the options and we create a Vote >>>> maybe? >>>> >>> >>>> >>> I am also on the same page on the fact that a veto should come with >>>> a clear and reasonable technical merit, which also in my opinion has not >>>> come yet. >>>> >>> >>>> >>> I also apologise if any of my words sounded harsh or personal >>>> attacks, never meant to do so. >>>> >>> >>>> >>> My proposed option: >>>> >>> >>>> >>> 1) remove the limit and potentially make it configurable, >>>> >>> Motivation: >>>> >>> The system administrator can enforce a limit its users need to >>>> respect that it's in line with whatever the admin decided to be acceptable >>>> for them. >>>> >>> Default can stay the current one. >>>> >>> >>>> >>> That's my favourite at the moment, but I agree that potentially in >>>> the future this may need to change, as we may optimise the data structures >>>> for certain dimensions. I am a big fan of Yagni (you aren't going to need >>>> it) so I am ok we'll face a different discussion if that happens in the >>>> future. >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.h...@gmail.com> wrote: >>>> >>>> What I see so far: >>>> >>>> >>>> >>>> Much positive support for raising the limit >>>> >>>> Slightly less support for removing it or making it configurable >>>> >>>> A single veto which argues that a (as yet undefined) performance >>>> standard must be met before raising the limit >>>> >>>> Hot tempers (various) making this discussion difficult >>>> >>>> >>>> >>>> As I understand it, vetoes must have technical merit. I'm not sure >>>> that this veto rises to "technical merit" on 2 counts: >>>> >>>> >>>> >>>> No standard for the performance is given so it cannot be >>>> technically met. Without hard criteria it's a moving target. >>>> >>>> It appears to encode a valuation of the user's time, and that >>>> valuation is really up to the user. Some users may consider 2hours useless >>>> and not worth it, and others might happily wait 2 hours. This is not a >>>> technical decision, it's a business decision regarding the relative value >>>> of the time invested vs the value of the result. If I can cure cancer by >>>> indexing for a year, that might be worth it... (hyperbole of course). >>>> >>>> >>>> >>>> Things I would consider to have technical merit that I don't hear: >>>> >>>> >>>> >>>> Impact on the speed of **other** indexing operations. (devaluation >>>> of other functionality) >>>> >>>> Actual scenarios that work when the limit is low and fail when the >>>> limit is high (new failure on the same data with the limit raised). >>>> >>>> >>>> >>>> One thing that might or might not have technical merit >>>> >>>> >>>> >>>> If someone feels there is a lack of documentation of the >>>> costs/performance implications of using large vectors, possibly including >>>> reproducible benchmarks establishing the scaling behavior (there seems to >>>> be disagreement on O(n) vs O(n^2)). >>>> >>>> >>>> >>>> The users *should* know what they are getting into, but if the >>>> cost is worth it to them, they should be able to pay it without forking the >>>> project. If this veto causes a fork that's not good. >>>> >>>> >>>> >>>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <msoko...@gmail.com> >>>> wrote: >>>> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes >>>> in 100 and 300 dimensional varieties and can easily enough generate large >>>> numbers of vector documents from the articles data. To go higher we could >>>> concatenate vectors from that and I believe the performance numbers would >>>> be plausible. >>>> >>>>> >>>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.we...@gmail.com> >>>> wrote: >>>> >>>>>> Can we set up a branch in which the limit is bumped to 2048, >>>> then have >>>> >>>>>> a realistic, free data set (wikipedia sample or something) that >>>> has, >>>> >>>>>> say, 5 million docs and vectors created using public data (glove >>>> >>>>>> pre-trained embeddings or the like)? We then could run indexing >>>> on the >>>> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers, >>>> limits >>>> >>>>>> and behavior actually are. >>>> >>>>>> >>>> >>>>>> I can help in writing this but not until after Easter. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Dawid >>>> >>>>>> >>>> >>>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> >>>> wrote: >>>> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule >>>> for >>>> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and >>>> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people >>>> on this >>>> >>>>>>> project who worked the most on debugging subtle bugs, making >>>> Lucene >>>> >>>>>>> more robust and improving our test framework, so I'm listening >>>> when he >>>> >>>>>>> voices quality concerns. >>>> >>>>>>> >>>> >>>>>>> The argument against removing/raising the limit that resonates >>>> with me >>>> >>>>>>> the most is that it is a one-way door. As MikeS highlighted >>>> earlier on >>>> >>>>>>> this thread, implementations may want to take advantage of the >>>> fact >>>> >>>>>>> that there is a limit at some point too. This is why I don't >>>> want to >>>> >>>>>>> remove the limit and would prefer a slight increase, such as >>>> 2048 as >>>> >>>>>>> suggested in the original issue, which would enable most of the >>>> things >>>> >>>>>>> that users who have been asking about raising the limit would >>>> like to >>>> >>>>>>> do. >>>> >>>>>>> >>>> >>>>>>> I agree that the merge-time memory usage and slow indexing rate >>>> are >>>> >>>>>>> not great. But it's still possible to index multi-million vector >>>> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the >>>> >>>>>>> number of dimensions, and the feedback I'm seeing is that many >>>> users >>>> >>>>>>> are still interested in indexing multi-million vector datasets >>>> despite >>>> >>>>>>> the slow indexing rate. I wish we could do better, and vector >>>> indexing >>>> >>>>>>> is certainly more expert than text indexing, but it still is >>>> usable in >>>> >>>>>>> my opinion. I understand how giving Lucene more information >>>> about >>>> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim >>>> pointed >>>> >>>>>>> out) could help make merging faster and more memory-efficient, >>>> but I >>>> >>>>>>> would really like to avoid making it a requirement for indexing >>>> >>>>>>> vectors as it also makes this feature much harder to use. >>>> >>>>>>> >>>> >>>>>>> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti >>>> >>>>>>> <a.benede...@sease.io> wrote: >>>> >>>>>>>> I am very attentive to listen opinions but I am un-convinced >>>> here and I an not sure that a single person opinion should be allowed to be >>>> detrimental for such an important project. >>>> >>>>>>>> >>>> >>>>>>>> The limit as far as I know is literally just raising an >>>> exception. >>>> >>>>>>>> Removing it won't alter in any way the current performance for >>>> users in low dimensional space. >>>> >>>>>>>> Removing it will just enable more users to use Lucene. >>>> >>>>>>>> >>>> >>>>>>>> If new users in certain situations will be unhappy with the >>>> performance, they may contribute improvements. >>>> >>>>>>>> This is how you make progress. >>>> >>>>>>>> >>>> >>>>>>>> If it's a reputation thing, trust me that not allowing users >>>> to play with high dimensional space will equally damage it. >>>> >>>>>>>> >>>> >>>>>>>> To me it's really a no brainer. >>>> >>>>>>>> Removing the limit and enable people to use high dimensional >>>> vectors will take minutes. >>>> >>>>>>>> Improving the hnsw implementation can take months. >>>> >>>>>>>> Pick one to begin with... >>>> >>>>>>>> >>>> >>>>>>>> And there's no-one paying me here, no company interest >>>> whatsoever, actually I pay people to contribute, I am just convinced it's a >>>> good idea. >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> >>>> wrote: >>>> >>>>>>>>> I disagree with your categorization. I put in plenty of work >>>> and >>>> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting >>>> these >>>> >>>>>>>>> issues, after i saw that, two releases in a row, vector >>>> indexing fell >>>> >>>>>>>>> over and hit integer overflows etc on small datasets: >>>> >>>>>>>>> >>>> >>>>>>>>> https://github.com/apache/lucene/pull/11905 >>>> >>>>>>>>> >>>> >>>>>>>>> Attacking me isn't helping the situation. >>>> >>>>>>>>> >>>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't >>>> mean it in >>>> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the >>>> current >>>> >>>>>>>>> state of usability with respect to indexing a few million >>>> docs with >>>> >>>>>>>>> high dimensions. You can scroll up the thread and see that at >>>> least >>>> >>>>>>>>> one other committer on the project experienced similar pain >>>> as me. >>>> >>>>>>>>> Then, think about users who aren't committers trying to use >>>> the >>>> >>>>>>>>> functionality! >>>> >>>>>>>>> >>>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov < >>>> msoko...@gmail.com> wrote: >>>> >>>>>>>>>> What you said about increasing dimensions requiring a bigger >>>> ram buffer on merge is wrong. That's the point I was trying to make. Your >>>> concerns about merge costs are not wrong, but your conclusion that we need >>>> to limit dimensions is not justified. >>>> >>>>>>>>>> >>>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I >>>> show it scales linearly with dimension you just ignore that and complain >>>> about something entirely different. >>>> >>>>>>>>>> >>>> >>>>>>>>>> You demand that people run all kinds of tests to prove you >>>> wrong but when they do, you don't listen and you won't put in the work >>>> yourself or complain that it's too hard. >>>> >>>>>>>>>> >>>> >>>>>>>>>> Then you complain about people not meeting you half way. Wow >>>> >>>>>>>>>> >>>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> >>>> wrote: >>>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner >>>> >>>>>>>>>>> <michael.wech...@wyona.com> wrote: >>>> >>>>>>>>>>>> What exactly do you consider reasonable? >>>> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the >>>> current >>>> >>>>>>>>>>> status. Please put politically correct or your own >>>> company's wishes >>>> >>>>>>>>>>> aside, we know it's not in a good state. >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> Current status is the one guy who wrote the code can set a >>>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with >>>> 1024 >>>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware). >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it >>>> to be >>>> >>>>>>>>>>> usable. Increasing dimensions just means even bigger >>>> multi-gigabyte >>>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge. >>>> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we >>>> have to >>>> >>>>>>>>>>> support it once we do this and we can't just say "oops" and >>>> flip it >>>> >>>>>>>>>>> back. >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is >>>> really to >>>> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS >>>> otherwise, >>>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM. >>>> >>>>>>>>>>> Also from personal experience, it takes trial and error >>>> (means >>>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap >>>> values correct >>>> >>>>>>>>>>> for your dataset. This usually means starting over which is >>>> >>>>>>>>>>> frustrating and wastes more time. >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in >>>> IndexWriter, seems >>>> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram >>>> buffer can be >>>> >>>>>>>>>>> avoided in this way and performance improved by writing >>>> bigger >>>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we >>>> can simply >>>> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs >>>> to scale so >>>> >>>>>>>>>>> that indexing really scales. >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and >>>> cause OOM, >>>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in >>>> O(n^2) >>>> >>>>>>>>>>> fashion when indexing. >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> --------------------------------------------------------------------- >>>> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> >>>>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>>>>>>>>> >>>> >>>>>>>>> >>>> --------------------------------------------------------------------- >>>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> >>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>>>>>>> >>>> >>>>>>> >>>> >>>>>>> -- >>>> >>>>>>> Adrien >>>> >>>>>>> >>>> >>>>>>> >>>> --------------------------------------------------------------------- >>>> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>>>>> >>>> >>>>>> >>>> --------------------------------------------------------------------- >>>> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> http://www.needhamsoftware.com (work) >>>> >>>> http://www.the111shift.com (play) >>>> >> >>>> > --------------------------------------------------------------------- >>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> > For additional commands, e-mail: dev-h...@lucene.apache.org >>>> > >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>> >> >> >