Re: [Proposal] Remove max number of dimensions for KNN vectors

Walter Underwood Thu, 06 Apr 2023 09:02:58 -0700

If we find issues with larger limits, maybe have a configurable limit like we 
do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap and do 
one query per second.


Where I work (LexisNexis), we have high-value queries, but just not that many 
of them per second.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)

> On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti <[email protected]> wrote:
> 
> To be clear Robert, I agree with you in not bumping it just to 2048 or 
> whatever not motivated enough constant. 
> 
> But I disagree on the performance perspective:
> I mean I am absolutely positive in working to improve the current 
> performances, but I think this is disconnected from that limit. 
> 
> Not all users need billions of vectors, maybe tomorrow a new chip is released 
> that speed up the processing 100x or whatever...
> 
> The limit as far as I know is not used to initialise or optimise any data 
> structure, it's only used to raise an exception. 
> 
> I don't see a big problem in allowing 10k vectors for example but then 
> majority of people won't be able to use such vectors because slow on the 
> average computer.
> If we just get 1 new user, it's better than 0.
> Or well, if it's a reputation thing, than It's a completely different 
> discussion I guess. 
> 
> 
> On Thu, 6 Apr 2023, 16:47 Robert Muir, <[email protected] 
> <mailto:[email protected]>> wrote:
>> Well, I'm asking ppl actually try to test using such high dimensions.
>> Based on my own experience, I consider it unusable. It seems other
>> folks may have run into trouble too. If the project committers can't
>> even really use vectors with such high dimension counts, then its not
>> in an OK state for users, and we shouldn't bump the limit.
>> 
>> I'm happy to discuss/compromise etc, but simply bumping the limit
>> without addressing the underlying usability/scalability is a real
>> no-go, it is not really solving anything, nor is it giving users any
>> freedom or allowing them to do something they couldnt do before.
>> Because if it still doesnt work it still doesnt work.
>> 
>> We all need to be on the same page, grounded in reality, not fantasy,
>> where if we set a limit of 1024 or 2048, that you can actually index
>> vectors with that many dimensions and it actually works and scales.
>> 
>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
>> <[email protected] <mailto:[email protected]>> wrote:
>> >
>> > As I said earlier, a max limit limits usability.
>> > It's not forcing users with small vectors to pay the performance penalty 
>> > of big vectors, it's literally preventing some users to use 
>> > Lucene/Solr/Elasticsearch at all.
>> > As far as I know, the max limit is used to raise an exception, it's not 
>> > used to initialise or optimise data structures (please correct me if I'm 
>> > wrong).
>> >
>> > Improving the algorithm performance is a separate discussion.
>> > I don't see a correlation with the fact that indexing billions of whatever 
>> > dimensioned vector is slow with a usability parameter.
>> >
>> > What about potential users that need few high dimensional vectors?
>> >
>> > As I said before, I am a big +1 for NOT just raise it blindly, but I 
>> > believe we need to remove the limit or size it in a way it's not a problem 
>> > for both users and internal data structure optimizations, if any.
>> >
>> >
>> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <[email protected] 
>> > <mailto:[email protected]>> wrote:
>> >>
>> >> I'd ask anyone voting +1 to raise this limit to at least try to index
>> >> a few million vectors with 756 or 1024, which is allowed today.
>> >>
>> >> IMO based on how painful it is, it seems the limit is already too
>> >> high, I realize that will sound controversial but please at least try
>> >> it out!
>> >>
>> >> voting +1 without at least doing this is really the
>> >> "weak/unscientifically minded" approach.
>> >>
>> >> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
>> >> <[email protected] <mailto:[email protected]>> wrote:
>> >> >
>> >> > Thanks for your feedback!
>> >> >
>> >> > I agree, that it should not crash.
>> >> >
>> >> > So far we did not experience crashes ourselves, but we did not index
>> >> > millions of vectors.
>> >> >
>> >> > I will try to reproduce the crash, maybe this will help us to move 
>> >> > forward.
>> >> >
>> >> > Thanks
>> >> >
>> >> > Michael
>> >> >
>> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> >> > >> Can you describe your crash in more detail?
>> >> > > I can't. That experiment was a while ago and a quick test to see if I
>> >> > > could index rather large-ish USPTO (patent office) data as vectors.
>> >> > > Couldn't do it then.
>> >> > >
>> >> > >> How much RAM?
>> >> > > My indexing jobs run with rather smallish heaps to give space for I/O
>> >> > > buffers. Think 4-8GB at most. So yes, it could have been the problem.
>> >> > > I recall segment merging grew slower and slower and then simply
>> >> > > crashed. Lucene should work with low heap requirements, even if it
>> >> > > slows down. Throwing ram at the indexing/ segment merging problem
>> >> > > is... I don't know - not elegant?
>> >> > >
>> >> > > Anyway. My main point was to remind folks about how Apache works -
>> >> > > code is merged in when there are no vetoes. If Rob (or anybody else)
>> >> > > remains unconvinced, he or she can block the change. (I didn't invent
>> >> > > those rules).
>> >> > >
>> >> > > D.
>> >> > >
>> >> > > ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: [email protected] 
>> >> > > <mailto:[email protected]>
>> >> > > For additional commands, e-mail: [email protected] 
>> >> > > <mailto:[email protected]>
>> >> > >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [email protected] 
>> >> > <mailto:[email protected]>
>> >> > For additional commands, e-mail: [email protected] 
>> >> > <mailto:[email protected]>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected] 
>> >> <mailto:[email protected]>
>> >> For additional commands, e-mail: [email protected] 
>> >> <mailto:[email protected]>
>> >>
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] 
>> <mailto:[email protected]>
>> For additional commands, e-mail: [email protected] 
>> <mailto:[email protected]>
>>

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to