I also want to add that we do impose some other limits on graph
construction to help ensure that HNSW-based vector fields remain
manageable; M is limited to <= 512, and maximum segment size also
helps limit merge costs

On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov <msoko...@gmail.com> wrote:
>
> Thanks Kent - I tried something similar to what you did I think. Took
> a set of 256d vectors I had and concatenated them to make bigger ones,
> then shifted the dimensions to make more of them. Here are a few
> single-threaded indexing test runs. I ran all tests with M=16.
>
>
> 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> buffer size=1994)
> 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)
>
> increasing the vector dimension makes things take longer (scaling
> *linearly*) but doesn't lead to RAM issues. I think we could get to
> OOM while merging with a small heap and a large number of vectors, or
> by increasing M, but none of this has anything to do with vector
> dimensions. Also, if merge RAM usage is a problem I think we could
> address it by adding accounting to the merge process and simply not
> merging graphs when they exceed the buffer size (as we do with
> flushing).
>
> Robert, since you're the only on-the-record veto here, does this
> change your thinking at all, or if not could you share some test
> results that didn't go the way you expected? Maybe we can find some
> mitigation if we focus on a specific issue.
>
> On Fri, Apr 7, 2023 at 5:18 AM Kent Fitch <kent.fi...@gmail.com> wrote:
> >
> > Hi,
> > I have been testing Lucene with a custom vector similarity and loaded 192m 
> > vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..).
> >
> > As this was a performance test, the 192m vectors were derived by dithering 
> > 47k original vectors in such a way to allow realistic ANN evaluation of 
> > HNSW.  The original 47k vectors were generated by ada-002 on source 
> > newspaper article text.  After dithering, I used PQ to reduce their 
> > dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a 
> > 1byte code, 512 code tables, each learnt to reduce total encoding error 
> > using Lloyds algorithm (hence the need for the custom similarity). BTW, 
> > HNSW retrieval was accurate and fast enough for the use case I was 
> > investigating as long as a machine with 128gb memory was available as the 
> > graph needs to be cached in memory for reasonable query rates.
> >
> > Anyway, if you want them, you are welcome to those 47k vectors of 1532 
> > floats which can be readily dithered to generate very large and realistic 
> > test vector sets.
> >
> > Best regards,
> >
> > Kent Fitch
> >
> >
> > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wech...@wyona.com> 
> > wrote:
> >>
> >> you might want to use SentenceBERT to generate vectors
> >>
> >> https://sbert.net
> >>
> >> whereas for example the model "all-mpnet-base-v2" generates vectors with 
> >> dimension 768
> >>
> >> We have SentenceBERT running as a web service, which we could open for 
> >> these tests, but because of network latency it should be faster running 
> >> locally.
> >>
> >> HTH
> >>
> >> Michael
> >>
> >>
> >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> >>
> >> I've started to look on the internet, and surely someone will come, but 
> >> the challenge I suspect is that these vectors are expensive to generate so 
> >> people have not gone all in on generating such large vectors for large 
> >> datasets. They certainly have not made them easy to find. Here is the most 
> >> promising but it is too small, probably:  
> >> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> >>
> >>  I'm still in and out of the office at the moment, but when I return, I 
> >> can ask my employer if they will sponsor a 10 million document collection 
> >> so that you can test with that. Or, maybe someone from work will see and 
> >> ask them on my behalf.
> >>
> >> Alternatively, next week, I may get some time to set up a server with an 
> >> open source LLM to generate the vectors. It still won't be free, but it 
> >> would be 99% cheaper than paying the LLM companies if we can be slow.
> >>
> >>
> >>
> >> On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner <michael.wech...@wyona.com> 
> >> wrote:
> >>>
> >>> Great, thank you!
> >>>
> >>> How much RAM; etc. did you run this test on?
> >>>
> >>> Do the vectors really have to be based on real data for testing the
> >>> indexing?
> >>> I understand, if you want to test the quality of the search results it
> >>> does matter, but for testing the scalability itself it should not matter
> >>> actually, right?
> >>>
> >>> Thanks
> >>>
> >>> Michael
> >>>
> >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
> >>> > minutes with a single thread. I have some 256K vectors, but only about
> >>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
> >>> > vectors I can use for testing? If all else fails I can test with
> >>> > noise, but that tends to lead to meaningless results
> >>> >
> >>> > On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
> >>> > <michael.wech...@wyona.com> wrote:
> >>> >>
> >>> >>
> >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> >>> >>> Well, I'm asking ppl actually try to test using such high dimensions.
> >>> >>> Based on my own experience, I consider it unusable. It seems other
> >>> >>> folks may have run into trouble too. If the project committers can't
> >>> >>> even really use vectors with such high dimension counts, then its not
> >>> >>> in an OK state for users, and we shouldn't bump the limit.
> >>> >>>
> >>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit
> >>> >>> without addressing the underlying usability/scalability is a real
> >>> >>> no-go,
> >>> >> I agree that this needs to be adressed
> >>> >>
> >>> >>
> >>> >>
> >>> >>>    it is not really solving anything, nor is it giving users any
> >>> >>> freedom or allowing them to do something they couldnt do before.
> >>> >>> Because if it still doesnt work it still doesnt work.
> >>> >> I disagree, because it *does work* with "smaller" document sets.
> >>> >>
> >>> >> Currently we have to compile Lucene ourselves to not get the exception
> >>> >> when using a model with vector dimension greater than 1024,
> >>> >> which is of course possible, but not really convenient.
> >>> >>
> >>> >> As I wrote before, to resolve this discussion, I think we should test
> >>> >> and address possible issues.
> >>> >>
> >>> >> I will try to stop discussing now :-) and instead try to understand
> >>> >> better the actual issues. Would be great if others could join on this!
> >>> >>
> >>> >> Thanks
> >>> >>
> >>> >> Michael
> >>> >>
> >>> >>
> >>> >>
> >>> >>> We all need to be on the same page, grounded in reality, not fantasy,
> >>> >>> where if we set a limit of 1024 or 2048, that you can actually index
> >>> >>> vectors with that many dimensions and it actually works and scales.
> >>> >>>
> >>> >>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
> >>> >>> <a.benede...@sease.io> wrote:
> >>> >>>> As I said earlier, a max limit limits usability.
> >>> >>>> It's not forcing users with small vectors to pay the performance 
> >>> >>>> penalty of big vectors, it's literally preventing some users to use 
> >>> >>>> Lucene/Solr/Elasticsearch at all.
> >>> >>>> As far as I know, the max limit is used to raise an exception, it's 
> >>> >>>> not used to initialise or optimise data structures (please correct 
> >>> >>>> me if I'm wrong).
> >>> >>>>
> >>> >>>> Improving the algorithm performance is a separate discussion.
> >>> >>>> I don't see a correlation with the fact that indexing billions of 
> >>> >>>> whatever dimensioned vector is slow with a usability parameter.
> >>> >>>>
> >>> >>>> What about potential users that need few high dimensional vectors?
> >>> >>>>
> >>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I 
> >>> >>>> believe we need to remove the limit or size it in a way it's not a 
> >>> >>>> problem for both users and internal data structure optimizations, if 
> >>> >>>> any.
> >>> >>>>
> >>> >>>>
> >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote:
> >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to 
> >>> >>>>> index
> >>> >>>>> a few million vectors with 756 or 1024, which is allowed today.
> >>> >>>>>
> >>> >>>>> IMO based on how painful it is, it seems the limit is already too
> >>> >>>>> high, I realize that will sound controversial but please at least 
> >>> >>>>> try
> >>> >>>>> it out!
> >>> >>>>>
> >>> >>>>> voting +1 without at least doing this is really the
> >>> >>>>> "weak/unscientifically minded" approach.
> >>> >>>>>
> >>> >>>>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
> >>> >>>>> <michael.wech...@wyona.com> wrote:
> >>> >>>>>> Thanks for your feedback!
> >>> >>>>>>
> >>> >>>>>> I agree, that it should not crash.
> >>> >>>>>>
> >>> >>>>>> So far we did not experience crashes ourselves, but we did not 
> >>> >>>>>> index
> >>> >>>>>> millions of vectors.
> >>> >>>>>>
> >>> >>>>>> I will try to reproduce the crash, maybe this will help us to move 
> >>> >>>>>> forward.
> >>> >>>>>>
> >>> >>>>>> Thanks
> >>> >>>>>>
> >>> >>>>>> Michael
> >>> >>>>>>
> >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> >>> >>>>>>>> Can you describe your crash in more detail?
> >>> >>>>>>> I can't. That experiment was a while ago and a quick test to see 
> >>> >>>>>>> if I
> >>> >>>>>>> could index rather large-ish USPTO (patent office) data as 
> >>> >>>>>>> vectors.
> >>> >>>>>>> Couldn't do it then.
> >>> >>>>>>>
> >>> >>>>>>>> How much RAM?
> >>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for 
> >>> >>>>>>> I/O
> >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the 
> >>> >>>>>>> problem.
> >>> >>>>>>> I recall segment merging grew slower and slower and then simply
> >>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it
> >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem
> >>> >>>>>>> is... I don't know - not elegant?
> >>> >>>>>>>
> >>> >>>>>>> Anyway. My main point was to remind folks about how Apache works -
> >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody 
> >>> >>>>>>> else)
> >>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't 
> >>> >>>>>>> invent
> >>> >>>>>>> those rules).
> >>> >>>>>>>
> >>> >>>>>>> D.
> >>> >>>>>>>
> >>> >>>>>>> ---------------------------------------------------------------------
> >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>> >>>>>>>
> >>> >>>>>> ---------------------------------------------------------------------
> >>> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>> >>>>>>
> >>> >>>>> ---------------------------------------------------------------------
> >>> >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>> >>>>>
> >>> >>> ---------------------------------------------------------------------
> >>> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>> >>>
> >>> >>
> >>> >> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>> >>
> >>> > ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >>> >
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>>
> >>
> >>
> >> --
> >> Marcus Eagan
> >>
> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to