Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Sokolov Thu, 06 Apr 2023 16:19:51 -0700

I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
minutes with a single thread. I have some 256K vectors, but only about
2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
vectors I can use for testing? If all else fails I can test with
noise, but that tends to lead to meaningless results


On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
<michael.wech...@wyona.com> wrote:
>
>
>
> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > Well, I'm asking ppl actually try to test using such high dimensions.
> > Based on my own experience, I consider it unusable. It seems other
> > folks may have run into trouble too. If the project committers can't
> > even really use vectors with such high dimension counts, then its not
> > in an OK state for users, and we shouldn't bump the limit.
> >
> > I'm happy to discuss/compromise etc, but simply bumping the limit
> > without addressing the underlying usability/scalability is a real
> > no-go,
>
> I agree that this needs to be adressed
>
>
>
> >   it is not really solving anything, nor is it giving users any
> > freedom or allowing them to do something they couldnt do before.
> > Because if it still doesnt work it still doesnt work.
>
> I disagree, because it *does work* with "smaller" document sets.
>
> Currently we have to compile Lucene ourselves to not get the exception
> when using a model with vector dimension greater than 1024,
> which is of course possible, but not really convenient.
>
> As I wrote before, to resolve this discussion, I think we should test
> and address possible issues.
>
> I will try to stop discussing now :-) and instead try to understand
> better the actual issues. Would be great if others could join on this!
>
> Thanks
>
> Michael
>
>
>
> >
> > We all need to be on the same page, grounded in reality, not fantasy,
> > where if we set a limit of 1024 or 2048, that you can actually index
> > vectors with that many dimensions and it actually works and scales.
> >
> > On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
> > <a.benede...@sease.io> wrote:
> >> As I said earlier, a max limit limits usability.
> >> It's not forcing users with small vectors to pay the performance penalty 
> >> of big vectors, it's literally preventing some users to use 
> >> Lucene/Solr/Elasticsearch at all.
> >> As far as I know, the max limit is used to raise an exception, it's not 
> >> used to initialise or optimise data structures (please correct me if I'm 
> >> wrong).
> >>
> >> Improving the algorithm performance is a separate discussion.
> >> I don't see a correlation with the fact that indexing billions of whatever 
> >> dimensioned vector is slow with a usability parameter.
> >>
> >> What about potential users that need few high dimensional vectors?
> >>
> >> As I said before, I am a big +1 for NOT just raise it blindly, but I 
> >> believe we need to remove the limit or size it in a way it's not a problem 
> >> for both users and internal data structure optimizations, if any.
> >>
> >>
> >> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote:
> >>> I'd ask anyone voting +1 to raise this limit to at least try to index
> >>> a few million vectors with 756 or 1024, which is allowed today.
> >>>
> >>> IMO based on how painful it is, it seems the limit is already too
> >>> high, I realize that will sound controversial but please at least try
> >>> it out!
> >>>
> >>> voting +1 without at least doing this is really the
> >>> "weak/unscientifically minded" approach.
> >>>
> >>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
> >>> <michael.wech...@wyona.com> wrote:
> >>>> Thanks for your feedback!
> >>>>
> >>>> I agree, that it should not crash.
> >>>>
> >>>> So far we did not experience crashes ourselves, but we did not index
> >>>> millions of vectors.
> >>>>
> >>>> I will try to reproduce the crash, maybe this will help us to move 
> >>>> forward.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Michael
> >>>>
> >>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>>>>> Can you describe your crash in more detail?
> >>>>> I can't. That experiment was a while ago and a quick test to see if I
> >>>>> could index rather large-ish USPTO (patent office) data as vectors.
> >>>>> Couldn't do it then.
> >>>>>
> >>>>>> How much RAM?
> >>>>> My indexing jobs run with rather smallish heaps to give space for I/O
> >>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem.
> >>>>> I recall segment merging grew slower and slower and then simply
> >>>>> crashed. Lucene should work with low heap requirements, even if it
> >>>>> slows down. Throwing ram at the indexing/ segment merging problem
> >>>>> is... I don't know - not elegant?
> >>>>>
> >>>>> Anyway. My main point was to remind folks about how Apache works -
> >>>>> code is merged in when there are no vetoes. If Rob (or anybody else)
> >>>>> remains unconvinced, he or she can block the change. (I didn't invent
> >>>>> those rules).
> >>>>>
> >>>>> D.
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to