I also want to add that we do impose some other limits on graph construction to help ensure that HNSW-based vector fields remain manageable; M is limited to <= 512, and maximum segment size also helps limit merge costs
On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov <msoko...@gmail.com> wrote: > > Thanks Kent - I tried something similar to what you did I think. Took > a set of 256d vectors I had and concatenated them to make bigger ones, > then shifted the dimensions to make more of them. Here are a few > single-threaded indexing test runs. I ran all tests with M=16. > > > 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter > buffer size=1994) > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994) > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994) > > increasing the vector dimension makes things take longer (scaling > *linearly*) but doesn't lead to RAM issues. I think we could get to > OOM while merging with a small heap and a large number of vectors, or > by increasing M, but none of this has anything to do with vector > dimensions. Also, if merge RAM usage is a problem I think we could > address it by adding accounting to the merge process and simply not > merging graphs when they exceed the buffer size (as we do with > flushing). > > Robert, since you're the only on-the-record veto here, does this > change your thinking at all, or if not could you share some test > results that didn't go the way you expected? Maybe we can find some > mitigation if we focus on a specific issue. > > On Fri, Apr 7, 2023 at 5:18 AM Kent Fitch <kent.fi...@gmail.com> wrote: > > > > Hi, > > I have been testing Lucene with a custom vector similarity and loaded 192m > > vectors of dim 512 bytes. (Yes, segment merges use a lot of java memory..). > > > > As this was a performance test, the 192m vectors were derived by dithering > > 47k original vectors in such a way to allow realistic ANN evaluation of > > HNSW. The original 47k vectors were generated by ada-002 on source > > newspaper article text. After dithering, I used PQ to reduce their > > dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a > > 1byte code, 512 code tables, each learnt to reduce total encoding error > > using Lloyds algorithm (hence the need for the custom similarity). BTW, > > HNSW retrieval was accurate and fast enough for the use case I was > > investigating as long as a machine with 128gb memory was available as the > > graph needs to be cached in memory for reasonable query rates. > > > > Anyway, if you want them, you are welcome to those 47k vectors of 1532 > > floats which can be readily dithered to generate very large and realistic > > test vector sets. > > > > Best regards, > > > > Kent Fitch > > > > > > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wech...@wyona.com> > > wrote: > >> > >> you might want to use SentenceBERT to generate vectors > >> > >> https://sbert.net > >> > >> whereas for example the model "all-mpnet-base-v2" generates vectors with > >> dimension 768 > >> > >> We have SentenceBERT running as a web service, which we could open for > >> these tests, but because of network latency it should be faster running > >> locally. > >> > >> HTH > >> > >> Michael > >> > >> > >> Am 07.04.23 um 10:11 schrieb Marcus Eagan: > >> > >> I've started to look on the internet, and surely someone will come, but > >> the challenge I suspect is that these vectors are expensive to generate so > >> people have not gone all in on generating such large vectors for large > >> datasets. They certainly have not made them easy to find. Here is the most > >> promising but it is too small, probably: > >> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download > >> > >> I'm still in and out of the office at the moment, but when I return, I > >> can ask my employer if they will sponsor a 10 million document collection > >> so that you can test with that. Or, maybe someone from work will see and > >> ask them on my behalf. > >> > >> Alternatively, next week, I may get some time to set up a server with an > >> open source LLM to generate the vectors. It still won't be free, but it > >> would be 99% cheaper than paying the LLM companies if we can be slow. > >> > >> > >> > >> On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner <michael.wech...@wyona.com> > >> wrote: > >>> > >>> Great, thank you! > >>> > >>> How much RAM; etc. did you run this test on? > >>> > >>> Do the vectors really have to be based on real data for testing the > >>> indexing? > >>> I understand, if you want to test the quality of the search results it > >>> does matter, but for testing the scalability itself it should not matter > >>> actually, right? > >>> > >>> Thanks > >>> > >>> Michael > >>> > >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov: > >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20 > >>> > minutes with a single thread. I have some 256K vectors, but only about > >>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim > >>> > vectors I can use for testing? If all else fails I can test with > >>> > noise, but that tends to lead to meaningless results > >>> > > >>> > On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner > >>> > <michael.wech...@wyona.com> wrote: > >>> >> > >>> >> > >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir: > >>> >>> Well, I'm asking ppl actually try to test using such high dimensions. > >>> >>> Based on my own experience, I consider it unusable. It seems other > >>> >>> folks may have run into trouble too. If the project committers can't > >>> >>> even really use vectors with such high dimension counts, then its not > >>> >>> in an OK state for users, and we shouldn't bump the limit. > >>> >>> > >>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit > >>> >>> without addressing the underlying usability/scalability is a real > >>> >>> no-go, > >>> >> I agree that this needs to be adressed > >>> >> > >>> >> > >>> >> > >>> >>> it is not really solving anything, nor is it giving users any > >>> >>> freedom or allowing them to do something they couldnt do before. > >>> >>> Because if it still doesnt work it still doesnt work. > >>> >> I disagree, because it *does work* with "smaller" document sets. > >>> >> > >>> >> Currently we have to compile Lucene ourselves to not get the exception > >>> >> when using a model with vector dimension greater than 1024, > >>> >> which is of course possible, but not really convenient. > >>> >> > >>> >> As I wrote before, to resolve this discussion, I think we should test > >>> >> and address possible issues. > >>> >> > >>> >> I will try to stop discussing now :-) and instead try to understand > >>> >> better the actual issues. Would be great if others could join on this! > >>> >> > >>> >> Thanks > >>> >> > >>> >> Michael > >>> >> > >>> >> > >>> >> > >>> >>> We all need to be on the same page, grounded in reality, not fantasy, > >>> >>> where if we set a limit of 1024 or 2048, that you can actually index > >>> >>> vectors with that many dimensions and it actually works and scales. > >>> >>> > >>> >>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti > >>> >>> <a.benede...@sease.io> wrote: > >>> >>>> As I said earlier, a max limit limits usability. > >>> >>>> It's not forcing users with small vectors to pay the performance > >>> >>>> penalty of big vectors, it's literally preventing some users to use > >>> >>>> Lucene/Solr/Elasticsearch at all. > >>> >>>> As far as I know, the max limit is used to raise an exception, it's > >>> >>>> not used to initialise or optimise data structures (please correct > >>> >>>> me if I'm wrong). > >>> >>>> > >>> >>>> Improving the algorithm performance is a separate discussion. > >>> >>>> I don't see a correlation with the fact that indexing billions of > >>> >>>> whatever dimensioned vector is slow with a usability parameter. > >>> >>>> > >>> >>>> What about potential users that need few high dimensional vectors? > >>> >>>> > >>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but I > >>> >>>> believe we need to remove the limit or size it in a way it's not a > >>> >>>> problem for both users and internal data structure optimizations, if > >>> >>>> any. > >>> >>>> > >>> >>>> > >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote: > >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to > >>> >>>>> index > >>> >>>>> a few million vectors with 756 or 1024, which is allowed today. > >>> >>>>> > >>> >>>>> IMO based on how painful it is, it seems the limit is already too > >>> >>>>> high, I realize that will sound controversial but please at least > >>> >>>>> try > >>> >>>>> it out! > >>> >>>>> > >>> >>>>> voting +1 without at least doing this is really the > >>> >>>>> "weak/unscientifically minded" approach. > >>> >>>>> > >>> >>>>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner > >>> >>>>> <michael.wech...@wyona.com> wrote: > >>> >>>>>> Thanks for your feedback! > >>> >>>>>> > >>> >>>>>> I agree, that it should not crash. > >>> >>>>>> > >>> >>>>>> So far we did not experience crashes ourselves, but we did not > >>> >>>>>> index > >>> >>>>>> millions of vectors. > >>> >>>>>> > >>> >>>>>> I will try to reproduce the crash, maybe this will help us to move > >>> >>>>>> forward. > >>> >>>>>> > >>> >>>>>> Thanks > >>> >>>>>> > >>> >>>>>> Michael > >>> >>>>>> > >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss: > >>> >>>>>>>> Can you describe your crash in more detail? > >>> >>>>>>> I can't. That experiment was a while ago and a quick test to see > >>> >>>>>>> if I > >>> >>>>>>> could index rather large-ish USPTO (patent office) data as > >>> >>>>>>> vectors. > >>> >>>>>>> Couldn't do it then. > >>> >>>>>>> > >>> >>>>>>>> How much RAM? > >>> >>>>>>> My indexing jobs run with rather smallish heaps to give space for > >>> >>>>>>> I/O > >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the > >>> >>>>>>> problem. > >>> >>>>>>> I recall segment merging grew slower and slower and then simply > >>> >>>>>>> crashed. Lucene should work with low heap requirements, even if it > >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging problem > >>> >>>>>>> is... I don't know - not elegant? > >>> >>>>>>> > >>> >>>>>>> Anyway. My main point was to remind folks about how Apache works - > >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody > >>> >>>>>>> else) > >>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't > >>> >>>>>>> invent > >>> >>>>>>> those rules). > >>> >>>>>>> > >>> >>>>>>> D. > >>> >>>>>>> > >>> >>>>>>> --------------------------------------------------------------------- > >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>> >>>>>>> > >>> >>>>>> --------------------------------------------------------------------- > >>> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>> >>>>>> > >>> >>>>> --------------------------------------------------------------------- > >>> >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>> >>>>> > >>> >>> --------------------------------------------------------------------- > >>> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> >>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>> >>> > >>> >> > >>> >> --------------------------------------------------------------------- > >>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> >> For additional commands, e-mail: dev-h...@lucene.apache.org > >>> >> > >>> > --------------------------------------------------------------------- > >>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> > For additional commands, e-mail: dev-h...@lucene.apache.org > >>> > > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>> > >> > >> > >> -- > >> Marcus Eagan > >> > >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org