I am not sure I get the point to make the limit configurable:

1) if it is configurable, but default max to 1024, it means that we don't
enforce any limit aside the max integer behind the scenes.
So if you want to set a vector dimension for a field to 5000 you need to
first set a MAX compatible and then set the dimension to 5000 for a field.

2) if we remove the limit (just an example). The user can directly set the
dimension to 5000 for a field.

It seems to me that setting the max limit as a configurable constant brings
all the same (negative?) considerations of removing the limit at all +
additional operations needed by the users to achieve the same results.

I beg your pardon if I an missing something.

On Thu, 6 Apr 2023, 17:02 Walter Underwood, <wun...@wunderwood.org> wrote:

> If we find issues with larger limits, maybe have a configurable limit like
> we do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap
> and do one query per second.
>
> Where I work (LexisNexis), we have high-value queries, but just not that
> many of them per second.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti <a.benede...@sease.io>
> wrote:
>
> To be clear Robert, I agree with you in not bumping it just to 2048 or
> whatever not motivated enough constant.
>
> But I disagree on the performance perspective:
> I mean I am absolutely positive in working to improve the current
> performances, but I think this is disconnected from that limit.
>
> Not all users need billions of vectors, maybe tomorrow a new chip is
> released that speed up the processing 100x or whatever...
>
> The limit as far as I know is not used to initialise or optimise any data
> structure, it's only used to raise an exception.
>
> I don't see a big problem in allowing 10k vectors for example but then
> majority of people won't be able to use such vectors because slow on the
> average computer.
> If we just get 1 new user, it's better than 0.
> Or well, if it's a reputation thing, than It's a completely different
> discussion I guess.
>
>
> On Thu, 6 Apr 2023, 16:47 Robert Muir, <rcm...@gmail.com> wrote:
>
>> Well, I'm asking ppl actually try to test using such high dimensions.
>> Based on my own experience, I consider it unusable. It seems other
>> folks may have run into trouble too. If the project committers can't
>> even really use vectors with such high dimension counts, then its not
>> in an OK state for users, and we shouldn't bump the limit.
>>
>> I'm happy to discuss/compromise etc, but simply bumping the limit
>> without addressing the underlying usability/scalability is a real
>> no-go, it is not really solving anything, nor is it giving users any
>> freedom or allowing them to do something they couldnt do before.
>> Because if it still doesnt work it still doesnt work.
>>
>> We all need to be on the same page, grounded in reality, not fantasy,
>> where if we set a limit of 1024 or 2048, that you can actually index
>> vectors with that many dimensions and it actually works and scales.
>>
>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
>> <a.benede...@sease.io> wrote:
>> >
>> > As I said earlier, a max limit limits usability.
>> > It's not forcing users with small vectors to pay the performance
>> penalty of big vectors, it's literally preventing some users to use
>> Lucene/Solr/Elasticsearch at all.
>> > As far as I know, the max limit is used to raise an exception, it's not
>> used to initialise or optimise data structures (please correct me if I'm
>> wrong).
>> >
>> > Improving the algorithm performance is a separate discussion.
>> > I don't see a correlation with the fact that indexing billions of
>> whatever dimensioned vector is slow with a usability parameter.
>> >
>> > What about potential users that need few high dimensional vectors?
>> >
>> > As I said before, I am a big +1 for NOT just raise it blindly, but I
>> believe we need to remove the limit or size it in a way it's not a problem
>> for both users and internal data structure optimizations, if any.
>> >
>> >
>> > On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote:
>> >>
>> >> I'd ask anyone voting +1 to raise this limit to at least try to index
>> >> a few million vectors with 756 or 1024, which is allowed today.
>> >>
>> >> IMO based on how painful it is, it seems the limit is already too
>> >> high, I realize that will sound controversial but please at least try
>> >> it out!
>> >>
>> >> voting +1 without at least doing this is really the
>> >> "weak/unscientifically minded" approach.
>> >>
>> >> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
>> >> <michael.wech...@wyona.com> wrote:
>> >> >
>> >> > Thanks for your feedback!
>> >> >
>> >> > I agree, that it should not crash.
>> >> >
>> >> > So far we did not experience crashes ourselves, but we did not index
>> >> > millions of vectors.
>> >> >
>> >> > I will try to reproduce the crash, maybe this will help us to move
>> forward.
>> >> >
>> >> > Thanks
>> >> >
>> >> > Michael
>> >> >
>> >> > Am 05.04.23 um 18:30 schrieb Dawid Weiss:
>> >> > >> Can you describe your crash in more detail?
>> >> > > I can't. That experiment was a while ago and a quick test to see
>> if I
>> >> > > could index rather large-ish USPTO (patent office) data as vectors.
>> >> > > Couldn't do it then.
>> >> > >
>> >> > >> How much RAM?
>> >> > > My indexing jobs run with rather smallish heaps to give space for
>> I/O
>> >> > > buffers. Think 4-8GB at most. So yes, it could have been the
>> problem.
>> >> > > I recall segment merging grew slower and slower and then simply
>> >> > > crashed. Lucene should work with low heap requirements, even if it
>> >> > > slows down. Throwing ram at the indexing/ segment merging problem
>> >> > > is... I don't know - not elegant?
>> >> > >
>> >> > > Anyway. My main point was to remind folks about how Apache works -
>> >> > > code is merged in when there are no vetoes. If Rob (or anybody
>> else)
>> >> > > remains unconvinced, he or she can block the change. (I didn't
>> invent
>> >> > > those rules).
>> >> > >
>> >> > > D.
>> >> > >
>> >> > >
>> ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> > > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >> > >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>

Reply via email to