I am not sure I get the point to make the limit configurable: 1) if it is configurable, but default max to 1024, it means that we don't enforce any limit aside the max integer behind the scenes. So if you want to set a vector dimension for a field to 5000 you need to first set a MAX compatible and then set the dimension to 5000 for a field.
2) if we remove the limit (just an example). The user can directly set the dimension to 5000 for a field. It seems to me that setting the max limit as a configurable constant brings all the same (negative?) considerations of removing the limit at all + additional operations needed by the users to achieve the same results. I beg your pardon if I an missing something. On Thu, 6 Apr 2023, 17:02 Walter Underwood, <wun...@wunderwood.org> wrote: > If we find issues with larger limits, maybe have a configurable limit like > we do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap > and do one query per second. > > Where I work (LexisNexis), we have high-value queries, but just not that > many of them per second. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti <a.benede...@sease.io> > wrote: > > To be clear Robert, I agree with you in not bumping it just to 2048 or > whatever not motivated enough constant. > > But I disagree on the performance perspective: > I mean I am absolutely positive in working to improve the current > performances, but I think this is disconnected from that limit. > > Not all users need billions of vectors, maybe tomorrow a new chip is > released that speed up the processing 100x or whatever... > > The limit as far as I know is not used to initialise or optimise any data > structure, it's only used to raise an exception. > > I don't see a big problem in allowing 10k vectors for example but then > majority of people won't be able to use such vectors because slow on the > average computer. > If we just get 1 new user, it's better than 0. > Or well, if it's a reputation thing, than It's a completely different > discussion I guess. > > > On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcm...@gmail.com> wrote: > >> Well, I'm asking ppl actually try to test using such high dimensions. >> Based on my own experience, I consider it unusable. It seems other >> folks may have run into trouble too. If the project committers can't >> even really use vectors with such high dimension counts, then its not >> in an OK state for users, and we shouldn't bump the limit. >> >> I'm happy to discuss/compromise etc, but simply bumping the limit >> without addressing the underlying usability/scalability is a real >> no-go, it is not really solving anything, nor is it giving users any >> freedom or allowing them to do something they couldnt do before. >> Because if it still doesnt work it still doesnt work. >> >> We all need to be on the same page, grounded in reality, not fantasy, >> where if we set a limit of 1024 or 2048, that you can actually index >> vectors with that many dimensions and it actually works and scales. >> >> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti >> <a.benede...@sease.io> wrote: >> > >> > As I said earlier, a max limit limits usability. >> > It's not forcing users with small vectors to pay the performance >> penalty of big vectors, it's literally preventing some users to use >> Lucene/Solr/Elasticsearch at all. >> > As far as I know, the max limit is used to raise an exception, it's not >> used to initialise or optimise data structures (please correct me if I'm >> wrong). >> > >> > Improving the algorithm performance is a separate discussion. >> > I don't see a correlation with the fact that indexing billions of >> whatever dimensioned vector is slow with a usability parameter. >> > >> > What about potential users that need few high dimensional vectors? >> > >> > As I said before, I am a big +1 for NOT just raise it blindly, but I >> believe we need to remove the limit or size it in a way it's not a problem >> for both users and internal data structure optimizations, if any. >> > >> > >> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote: >> >> >> >> I'd ask anyone voting +1 to raise this limit to at least try to index >> >> a few million vectors with 756 or 1024, which is allowed today. >> >> >> >> IMO based on how painful it is, it seems the limit is already too >> >> high, I realize that will sound controversial but please at least try >> >> it out! >> >> >> >> voting +1 without at least doing this is really the >> >> "weak/unscientifically minded" approach. >> >> >> >> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner >> >> <michael.wech...@wyona.com> wrote: >> >> > >> >> > Thanks for your feedback! >> >> > >> >> > I agree, that it should not crash. >> >> > >> >> > So far we did not experience crashes ourselves, but we did not index >> >> > millions of vectors. >> >> > >> >> > I will try to reproduce the crash, maybe this will help us to move >> forward. >> >> > >> >> > Thanks >> >> > >> >> > Michael >> >> > >> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss: >> >> > >> Can you describe your crash in more detail? >> >> > > I can't. That experiment was a while ago and a quick test to see >> if I >> >> > > could index rather large-ish USPTO (patent office) data as vectors. >> >> > > Couldn't do it then. >> >> > > >> >> > >> How much RAM? >> >> > > My indexing jobs run with rather smallish heaps to give space for >> I/O >> >> > > buffers. Think 4-8GB at most. So yes, it could have been the >> problem. >> >> > > I recall segment merging grew slower and slower and then simply >> >> > > crashed. Lucene should work with low heap requirements, even if it >> >> > > slows down. Throwing ram at the indexing/ segment merging problem >> >> > > is... I don't know - not elegant? >> >> > > >> >> > > Anyway. My main point was to remind folks about how Apache works - >> >> > > code is merged in when there are no vetoes. If Rob (or anybody >> else) >> >> > > remains unconvinced, he or she can block the change. (I didn't >> invent >> >> > > those rules). >> >> > > >> >> > > D. >> >> > > >> >> > > >> --------------------------------------------------------------------- >> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > >> >> > >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >