one more data point: 32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)
On Fri, Apr 7, 2023 at 8:52 AM Michael Sokolov <msoko...@gmail.com> wrote: > > I also want to add that we do impose some other limits on graph > construction to help ensure that HNSW-based vector fields remain > manageable; M is limited to <= 512, and maximum segment size also > helps limit merge costs > > On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov <msoko...@gmail.com> wrote: > > > > Thanks Kent - I tried something similar to what you did I think. Took > > a set of 256d vectors I had and concatenated them to make bigger ones, > > then shifted the dimensions to make more of them. Here are a few > > single-threaded indexing test runs. I ran all tests with M=16. > > > > > > 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter > > buffer size=1994) > > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994) > > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994) > > > > increasing the vector dimension makes things take longer (scaling > > *linearly*) but doesn't lead to RAM issues. I think we could get to > > OOM while merging with a small heap and a large number of vectors, or > > by increasing M, but none of this has anything to do with vector > > dimensions. Also, if merge RAM usage is a problem I think we could > > address it by adding accounting to the merge process and simply not > > merging graphs when they exceed the buffer size (as we do with > > flushing). > > > > Robert, since you're the only on-the-record veto here, does this > > change your thinking at all, or if not could you share some test > > results that didn't go the way you expected? Maybe we can find some > > mitigation if we focus on a specific issue. > > > > On Fri, Apr 7, 2023 at 5:18 AM Kent Fitch <kent.fi...@gmail.com> wrote: > > > > > > Hi, > > > I have been testing Lucene with a custom vector similarity and loaded > > > 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of java > > > memory..). > > > > > > As this was a performance test, the 192m vectors were derived by > > > dithering 47k original vectors in such a way to allow realistic ANN > > > evaluation of HNSW. The original 47k vectors were generated by ada-002 > > > on source newspaper article text. After dithering, I used PQ to reduce > > > their dimensionality from 1536 floats to 512 bytes - 3 source dimensions > > > to a 1byte code, 512 code tables, each learnt to reduce total encoding > > > error using Lloyds algorithm (hence the need for the custom similarity). > > > BTW, HNSW retrieval was accurate and fast enough for the use case I was > > > investigating as long as a machine with 128gb memory was available as the > > > graph needs to be cached in memory for reasonable query rates. > > > > > > Anyway, if you want them, you are welcome to those 47k vectors of 1532 > > > floats which can be readily dithered to generate very large and realistic > > > test vector sets. > > > > > > Best regards, > > > > > > Kent Fitch > > > > > > > > > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <michael.wech...@wyona.com> > > > wrote: > > >> > > >> you might want to use SentenceBERT to generate vectors > > >> > > >> https://sbert.net > > >> > > >> whereas for example the model "all-mpnet-base-v2" generates vectors with > > >> dimension 768 > > >> > > >> We have SentenceBERT running as a web service, which we could open for > > >> these tests, but because of network latency it should be faster running > > >> locally. > > >> > > >> HTH > > >> > > >> Michael > > >> > > >> > > >> Am 07.04.23 um 10:11 schrieb Marcus Eagan: > > >> > > >> I've started to look on the internet, and surely someone will come, but > > >> the challenge I suspect is that these vectors are expensive to generate > > >> so people have not gone all in on generating such large vectors for > > >> large datasets. They certainly have not made them easy to find. Here is > > >> the most promising but it is too small, probably: > > >> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download > > >> > > >> I'm still in and out of the office at the moment, but when I return, I > > >> can ask my employer if they will sponsor a 10 million document > > >> collection so that you can test with that. Or, maybe someone from work > > >> will see and ask them on my behalf. > > >> > > >> Alternatively, next week, I may get some time to set up a server with an > > >> open source LLM to generate the vectors. It still won't be free, but it > > >> would be 99% cheaper than paying the LLM companies if we can be slow. > > >> > > >> > > >> > > >> On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner > > >> <michael.wech...@wyona.com> wrote: > > >>> > > >>> Great, thank you! > > >>> > > >>> How much RAM; etc. did you run this test on? > > >>> > > >>> Do the vectors really have to be based on real data for testing the > > >>> indexing? > > >>> I understand, if you want to test the quality of the search results it > > >>> does matter, but for testing the scalability itself it should not matter > > >>> actually, right? > > >>> > > >>> Thanks > > >>> > > >>> Michael > > >>> > > >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov: > > >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20 > > >>> > minutes with a single thread. I have some 256K vectors, but only about > > >>> > 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim > > >>> > vectors I can use for testing? If all else fails I can test with > > >>> > noise, but that tends to lead to meaningless results > > >>> > > > >>> > On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner > > >>> > <michael.wech...@wyona.com> wrote: > > >>> >> > > >>> >> > > >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir: > > >>> >>> Well, I'm asking ppl actually try to test using such high > > >>> >>> dimensions. > > >>> >>> Based on my own experience, I consider it unusable. It seems other > > >>> >>> folks may have run into trouble too. If the project committers can't > > >>> >>> even really use vectors with such high dimension counts, then its > > >>> >>> not > > >>> >>> in an OK state for users, and we shouldn't bump the limit. > > >>> >>> > > >>> >>> I'm happy to discuss/compromise etc, but simply bumping the limit > > >>> >>> without addressing the underlying usability/scalability is a real > > >>> >>> no-go, > > >>> >> I agree that this needs to be adressed > > >>> >> > > >>> >> > > >>> >> > > >>> >>> it is not really solving anything, nor is it giving users any > > >>> >>> freedom or allowing them to do something they couldnt do before. > > >>> >>> Because if it still doesnt work it still doesnt work. > > >>> >> I disagree, because it *does work* with "smaller" document sets. > > >>> >> > > >>> >> Currently we have to compile Lucene ourselves to not get the > > >>> >> exception > > >>> >> when using a model with vector dimension greater than 1024, > > >>> >> which is of course possible, but not really convenient. > > >>> >> > > >>> >> As I wrote before, to resolve this discussion, I think we should test > > >>> >> and address possible issues. > > >>> >> > > >>> >> I will try to stop discussing now :-) and instead try to understand > > >>> >> better the actual issues. Would be great if others could join on > > >>> >> this! > > >>> >> > > >>> >> Thanks > > >>> >> > > >>> >> Michael > > >>> >> > > >>> >> > > >>> >> > > >>> >>> We all need to be on the same page, grounded in reality, not > > >>> >>> fantasy, > > >>> >>> where if we set a limit of 1024 or 2048, that you can actually index > > >>> >>> vectors with that many dimensions and it actually works and scales. > > >>> >>> > > >>> >>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti > > >>> >>> <a.benede...@sease.io> wrote: > > >>> >>>> As I said earlier, a max limit limits usability. > > >>> >>>> It's not forcing users with small vectors to pay the performance > > >>> >>>> penalty of big vectors, it's literally preventing some users to > > >>> >>>> use Lucene/Solr/Elasticsearch at all. > > >>> >>>> As far as I know, the max limit is used to raise an exception, > > >>> >>>> it's not used to initialise or optimise data structures (please > > >>> >>>> correct me if I'm wrong). > > >>> >>>> > > >>> >>>> Improving the algorithm performance is a separate discussion. > > >>> >>>> I don't see a correlation with the fact that indexing billions of > > >>> >>>> whatever dimensioned vector is slow with a usability parameter. > > >>> >>>> > > >>> >>>> What about potential users that need few high dimensional vectors? > > >>> >>>> > > >>> >>>> As I said before, I am a big +1 for NOT just raise it blindly, but > > >>> >>>> I believe we need to remove the limit or size it in a way it's not > > >>> >>>> a problem for both users and internal data structure > > >>> >>>> optimizations, if any. > > >>> >>>> > > >>> >>>> > > >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote: > > >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try to > > >>> >>>>> index > > >>> >>>>> a few million vectors with 756 or 1024, which is allowed today. > > >>> >>>>> > > >>> >>>>> IMO based on how painful it is, it seems the limit is already too > > >>> >>>>> high, I realize that will sound controversial but please at least > > >>> >>>>> try > > >>> >>>>> it out! > > >>> >>>>> > > >>> >>>>> voting +1 without at least doing this is really the > > >>> >>>>> "weak/unscientifically minded" approach. > > >>> >>>>> > > >>> >>>>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner > > >>> >>>>> <michael.wech...@wyona.com> wrote: > > >>> >>>>>> Thanks for your feedback! > > >>> >>>>>> > > >>> >>>>>> I agree, that it should not crash. > > >>> >>>>>> > > >>> >>>>>> So far we did not experience crashes ourselves, but we did not > > >>> >>>>>> index > > >>> >>>>>> millions of vectors. > > >>> >>>>>> > > >>> >>>>>> I will try to reproduce the crash, maybe this will help us to > > >>> >>>>>> move forward. > > >>> >>>>>> > > >>> >>>>>> Thanks > > >>> >>>>>> > > >>> >>>>>> Michael > > >>> >>>>>> > > >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss: > > >>> >>>>>>>> Can you describe your crash in more detail? > > >>> >>>>>>> I can't. That experiment was a while ago and a quick test to > > >>> >>>>>>> see if I > > >>> >>>>>>> could index rather large-ish USPTO (patent office) data as > > >>> >>>>>>> vectors. > > >>> >>>>>>> Couldn't do it then. > > >>> >>>>>>> > > >>> >>>>>>>> How much RAM? > > >>> >>>>>>> My indexing jobs run with rather smallish heaps to give space > > >>> >>>>>>> for I/O > > >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been the > > >>> >>>>>>> problem. > > >>> >>>>>>> I recall segment merging grew slower and slower and then simply > > >>> >>>>>>> crashed. Lucene should work with low heap requirements, even if > > >>> >>>>>>> it > > >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging > > >>> >>>>>>> problem > > >>> >>>>>>> is... I don't know - not elegant? > > >>> >>>>>>> > > >>> >>>>>>> Anyway. My main point was to remind folks about how Apache > > >>> >>>>>>> works - > > >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or anybody > > >>> >>>>>>> else) > > >>> >>>>>>> remains unconvinced, he or she can block the change. (I didn't > > >>> >>>>>>> invent > > >>> >>>>>>> those rules). > > >>> >>>>>>> > > >>> >>>>>>> D. > > >>> >>>>>>> > > >>> >>>>>>> --------------------------------------------------------------------- > > >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >>> >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > > >>> >>>>>>> > > >>> >>>>>> --------------------------------------------------------------------- > > >>> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >>> >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > > >>> >>>>>> > > >>> >>>>> --------------------------------------------------------------------- > > >>> >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >>> >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > > >>> >>>>> > > >>> >>> --------------------------------------------------------------------- > > >>> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >>> >>> For additional commands, e-mail: dev-h...@lucene.apache.org > > >>> >>> > > >>> >> > > >>> >> --------------------------------------------------------------------- > > >>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >>> >> For additional commands, e-mail: dev-h...@lucene.apache.org > > >>> >> > > >>> > --------------------------------------------------------------------- > > >>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >>> > For additional commands, e-mail: dev-h...@lucene.apache.org > > >>> > > > >>> > > >>> > > >>> --------------------------------------------------------------------- > > >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > >>> For additional commands, e-mail: dev-h...@lucene.apache.org > > >>> > > >> > > >> > > >> -- > > >> Marcus Eagan > > >> > > >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org