Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Sokolov Fri, 07 Apr 2023 08:20:51 -0700

one more data point:

32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)


On Fri, Apr 7, 2023 at 8:52 AM Michael Sokolov <[email protected]> wrote:
>
> I also want to add that we do impose some other limits on graph
> construction to help ensure that HNSW-based vector fields remain
> manageable; M is limited to <= 512, and maximum segment size also
> helps limit merge costs
>
> On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov <[email protected]> wrote:
> >
> > Thanks Kent - I tried something similar to what you did I think. Took
> > a set of 256d vectors I had and concatenated them to make bigger ones,
> > then shifted the dimensions to make more of them. Here are a few
> > single-threaded indexing test runs. I ran all tests with M=16.
> >
> >
> > 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> > buffer size=1994)
> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
> >
> > increasing the vector dimension makes things take longer (scaling
> > *linearly*) but doesn't lead to RAM issues. I think we could get to
> > OOM while merging with a small heap and a large number of vectors, or
> > by increasing M, but none of this has anything to do with vector
> > dimensions. Also, if merge RAM usage is a problem I think we could
> > address it by adding accounting to the merge process and simply not
> > merging graphs when they exceed the buffer size (as we do with
> > flushing).
> >
> > Robert, since you're the only on-the-record veto here, does this
> > change your thinking at all, or if not could you share some test
> > results that didn't go the way you expected? Maybe we can find some
> > mitigation if we focus on a specific issue.
> >
> > On Fri, Apr 7, 2023 at 5:18 AM Kent Fitch <[email protected]> wrote:
> > >
> > > Hi,
> > > I have been testing Lucene with a custom vector similarity and loaded 
> > > 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java 
> > > memory..).
> > >
> > > As this was a performance test, the 192m vectors were derived by 
> > > dithering 47k original vectors in such a way to allow realistic ANN 
> > > evaluation of HNSW.  The original 47k vectors were generated by ada-002 
> > > on source newspaper article text.  After dithering, I used PQ to reduce 
> > > their dimensionality from 1536 floats to 512 bytes - 3 source dimensions 
> > > to a 1byte code, 512 code tables, each learnt to reduce total encoding 
> > > error using Lloyds algorithm (hence the need for the custom similarity). 
> > > BTW, HNSW retrieval was accurate and fast enough for the use case I was 
> > > investigating as long as a machine with 128gb memory was available as the 
> > > graph needs to be cached in memory for reasonable query rates.
> > >
> > > Anyway, if you want them, you are welcome to those 47k vectors of 1532 
> > > floats which can be readily dithered to generate very large and realistic 
> > > test vector sets.
> > >
> > > Best regards,
> > >
> > > Kent Fitch
> > >
> > >
> > > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <[email protected]> 
> > > wrote:
> > >>
> > >> you might want to use SentenceBERT to generate vectors
> > >>
> > >> https://sbert.net
> > >>
> > >> whereas for example the model "all-mpnet-base-v2" generates vectors with 
> > >> dimension 768
> > >>
> > >> We have SentenceBERT running as a web service, which we could open for 
> > >> these tests, but because of network latency it should be faster running 
> > >> locally.
> > >>
> > >> HTH
> > >>
> > >> Michael
> > >>
> > >>
> > >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> > >>
> > >> I've started to look on the internet, and surely someone will come, but 
> > >> the challenge I suspect is that these vectors are expensive to generate 
> > >> so people have not gone all in on generating such large vectors for 
> > >> large datasets. They certainly have not made them easy to find. Here is 
> > >> the most promising but it is too small, probably:  
> > >> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> > >>
> > >>  I'm still in and out of the office at the moment, but when I return, I 
> > >> can ask my employer if they will sponsor a 10 million document 
> > >> collection so that you can test with that. Or, maybe someone from work 
> > >> will see and ask them on my behalf.
> > >>
> > >> Alternatively, next week, I may get some time to set up a server with an 
> > >> open source LLM to generate the vectors. It still won't be free, but it 
> > >> would be 99% cheaper than paying the LLM companies if we can be slow.
> > >>
> > >>
> > >>
> > >> On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner 
> > >> <[email protected]> wrote:
> > >>>
> > >>> Great, thank you!
> > >>>
> > >>> How much RAM; etc. did you run this test on?
> > >>>
> > >>> Do the vectors really have to be based on real data for testing the
> > >>> indexing?
> > >>> I understand, if you want to test the quality of the search results it
> > >>> does matter, but for testing the scalability itself it should not matter
> > >>> actually, right?
> > >>>
> > >>> Thanks
> > >>>
> > >>> Michael
> > >>>
> > >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> > >>> > minutes with a single thread. I have some 256K vectors, but only about
> > >>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> > >>> > vectors I can use for testing? If all else fails I can test with
> > >>> > noise, but that tends to lead to meaningless results
> > >>> >
> > >>> > On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
> > >>> > <[email protected]> wrote:
> > >>> >>
> > >>> >>
> > >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > >>> >>> Well, I'm asking ppl actually try to test using such high 
> > >>> >>> dimensions.
> > >>> >>> Based on my own experience, I consider it unusable. It seems other
> > >>> >>> folks may have run into trouble too. If the project committers can't
> > >>> >>> even really use vectors with such high dimension counts, then its 
> > >>> >>> not
> > >>> >>> in an OK state for users, and we shouldn't bump the limit.
> > >>> >>>
> > >>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> > >>> >>> without addressing the underlying usability/scalability is a real
> > >>> >>> no-go,
> > >>> >> I agree that this needs to be adressed
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>>    it is not really solving anything, nor is it giving users any
> > >>> >>> freedom or allowing them to do something they couldnt do before.
> > >>> >>> Because if it still doesnt work it still doesnt work.
> > >>> >> I disagree, because it *does work* with "smaller" document sets.
> > >>> >>
> > >>> >> Currently we have to compile Lucene ourselves to not get the 
> > >>> >> exception
> > >>> >> when using a model with vector dimension greater than 1024,
> > >>> >> which is of course possible, but not really convenient.
> > >>> >>
> > >>> >> As I wrote before, to resolve this discussion, I think we should test
> > >>> >> and address possible issues.
> > >>> >>
> > >>> >> I will try to stop discussing now :-) and instead try to understand
> > >>> >> better the actual issues. Would be great if others could join on 
> > >>> >> this!
> > >>> >>
> > >>> >> Thanks
> > >>> >>
> > >>> >> Michael
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>> We all need to be on the same page, grounded in reality, not 
> > >>> >>> fantasy,
> > >>> >>> where if we set a limit of 1024 or 2048, that you can actually index
> > >>> >>> vectors with that many dimensions and it actually works and scales.
> > >>> >>>
> > >>> >>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
> > >>> >>> <[email protected]> wrote:
> > >>> >>>> As I said earlier, a max limit limits usability.
> > >>> >>>> It's not forcing users with small vectors to pay the performance 
> > >>> >>>> penalty of big vectors, it's literally preventing some users to 
> > >>> >>>> use Lucene/Solr/Elasticsearch at all.
> > >>> >>>> As far as I know, the max limit is used to raise an exception, 
> > >>> >>>> it's not used to initialise or optimise data structures (please 
> > >>> >>>> correct me if I'm wrong).
> > >>> >>>>
> > >>> >>>> Improving the algorithm performance is a separate discussion.
> > >>> >>>> I don't see a correlation with the fact that indexing billions of 
> > >>> >>>> whatever dimensioned vector is slow with a usability parameter.
> > >>> >>>>
> > >>> >>>> What about potential users that need few high dimensional vectors?
> > >>> >>>>
> > >>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but 
> > >>> >>>> I believe we need to remove the limit or size it in a way it's not 
> > >>> >>>> a problem for both users and internal data structure 
> > >>> >>>> optimizations, if any.
> > >>> >>>>
> > >>> >>>>
> > >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <[email protected]> wrote:
> > >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to 
> > >>> >>>>> index
> > >>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> > >>> >>>>>
> > >>> >>>>> IMO based on how painful it is, it seems the limit is already too
> > >>> >>>>> high, I realize that will sound controversial but please at least 
> > >>> >>>>> try
> > >>> >>>>> it out!
> > >>> >>>>>
> > >>> >>>>> voting +1 without at least doing this is really the
> > >>> >>>>> "weak/unscientifically minded" approach.
> > >>> >>>>>
> > >>> >>>>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
> > >>> >>>>> <[email protected]> wrote:
> > >>> >>>>>> Thanks for your feedback!
> > >>> >>>>>>
> > >>> >>>>>> I agree, that it should not crash.
> > >>> >>>>>>
> > >>> >>>>>> So far we did not experience crashes ourselves, but we did not 
> > >>> >>>>>> index
> > >>> >>>>>> millions of vectors.
> > >>> >>>>>>
> > >>> >>>>>> I will try to reproduce the crash, maybe this will help us to 
> > >>> >>>>>> move forward.
> > >>> >>>>>>
> > >>> >>>>>> Thanks
> > >>> >>>>>>
> > >>> >>>>>> Michael
> > >>> >>>>>>
> > >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > >>> >>>>>>>> Can you describe your crash in more detail?
> > >>> >>>>>>> I can't. That experiment was a while ago and a quick test to 
> > >>> >>>>>>> see if I
> > >>> >>>>>>> could index rather large-ish USPTO (patent office) data as 
> > >>> >>>>>>> vectors.
> > >>> >>>>>>> Couldn't do it then.
> > >>> >>>>>>>
> > >>> >>>>>>>> How much RAM?
> > >>> >>>>>>> My indexing jobs run with rather smallish heaps to give space 
> > >>> >>>>>>> for I/O
> > >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the 
> > >>> >>>>>>> problem.
> > >>> >>>>>>> I recall segment merging grew slower and slower and then simply
> > >>> >>>>>>> crashed. Lucene should work with low heap requirements, even if 
> > >>> >>>>>>> it
> > >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging 
> > >>> >>>>>>> problem
> > >>> >>>>>>> is... I don't know - not elegant?
> > >>> >>>>>>>
> > >>> >>>>>>> Anyway. My main point was to remind folks about how Apache 
> > >>> >>>>>>> works -
> > >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody 
> > >>> >>>>>>> else)
> > >>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't 
> > >>> >>>>>>> invent
> > >>> >>>>>>> those rules).
> > >>> >>>>>>>
> > >>> >>>>>>> D.
> > >>> >>>>>>>
> > >>> >>>>>>> ---------------------------------------------------------------------
> > >>> >>>>>>> To unsubscribe, e-mail: [email protected]
> > >>> >>>>>>> For additional commands, e-mail: [email protected]
> > >>> >>>>>>>
> > >>> >>>>>> ---------------------------------------------------------------------
> > >>> >>>>>> To unsubscribe, e-mail: [email protected]
> > >>> >>>>>> For additional commands, e-mail: [email protected]
> > >>> >>>>>>
> > >>> >>>>> ---------------------------------------------------------------------
> > >>> >>>>> To unsubscribe, e-mail: [email protected]
> > >>> >>>>> For additional commands, e-mail: [email protected]
> > >>> >>>>>
> > >>> >>> ---------------------------------------------------------------------
> > >>> >>> To unsubscribe, e-mail: [email protected]
> > >>> >>> For additional commands, e-mail: [email protected]
> > >>> >>>
> > >>> >>
> > >>> >> ---------------------------------------------------------------------
> > >>> >> To unsubscribe, e-mail: [email protected]
> > >>> >> For additional commands, e-mail: [email protected]
> > >>> >>
> > >>> > ---------------------------------------------------------------------
> > >>> > To unsubscribe, e-mail: [email protected]
> > >>> > For additional commands, e-mail: [email protected]
> > >>> >
> > >>>
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: [email protected]
> > >>> For additional commands, e-mail: [email protected]
> > >>>
> > >>
> > >>
> > >> --
> > >> Marcus Eagan
> > >>
> > >>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to