I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20 minutes with a single thread. I have some 256K vectors, but only about 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim vectors I can use for testing? If all else fails I can test with noise, but that tends to lead to meaningless results
On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner <michael.wech...@wyona.com> wrote: > > > > Am 06.04.23 um 17:47 schrieb Robert Muir: > > Well, I'm asking ppl actually try to test using such high dimensions. > > Based on my own experience, I consider it unusable. It seems other > > folks may have run into trouble too. If the project committers can't > > even really use vectors with such high dimension counts, then its not > > in an OK state for users, and we shouldn't bump the limit. > > > > I'm happy to discuss/compromise etc, but simply bumping the limit > > without addressing the underlying usability/scalability is a real > > no-go, > > I agree that this needs to be adressed > > > > > it is not really solving anything, nor is it giving users any > > freedom or allowing them to do something they couldnt do before. > > Because if it still doesnt work it still doesnt work. > > I disagree, because it *does work* with "smaller" document sets. > > Currently we have to compile Lucene ourselves to not get the exception > when using a model with vector dimension greater than 1024, > which is of course possible, but not really convenient. > > As I wrote before, to resolve this discussion, I think we should test > and address possible issues. > > I will try to stop discussing now :-) and instead try to understand > better the actual issues. Would be great if others could join on this! > > Thanks > > Michael > > > > > > > We all need to be on the same page, grounded in reality, not fantasy, > > where if we set a limit of 1024 or 2048, that you can actually index > > vectors with that many dimensions and it actually works and scales. > > > > On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti > > <a.benede...@sease.io> wrote: > >> As I said earlier, a max limit limits usability. > >> It's not forcing users with small vectors to pay the performance penalty > >> of big vectors, it's literally preventing some users to use > >> Lucene/Solr/Elasticsearch at all. > >> As far as I know, the max limit is used to raise an exception, it's not > >> used to initialise or optimise data structures (please correct me if I'm > >> wrong). > >> > >> Improving the algorithm performance is a separate discussion. > >> I don't see a correlation with the fact that indexing billions of whatever > >> dimensioned vector is slow with a usability parameter. > >> > >> What about potential users that need few high dimensional vectors? > >> > >> As I said before, I am a big +1 for NOT just raise it blindly, but I > >> believe we need to remove the limit or size it in a way it's not a problem > >> for both users and internal data structure optimizations, if any. > >> > >> > >> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com> wrote: > >>> I'd ask anyone voting +1 to raise this limit to at least try to index > >>> a few million vectors with 756 or 1024, which is allowed today. > >>> > >>> IMO based on how painful it is, it seems the limit is already too > >>> high, I realize that will sound controversial but please at least try > >>> it out! > >>> > >>> voting +1 without at least doing this is really the > >>> "weak/unscientifically minded" approach. > >>> > >>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner > >>> <michael.wech...@wyona.com> wrote: > >>>> Thanks for your feedback! > >>>> > >>>> I agree, that it should not crash. > >>>> > >>>> So far we did not experience crashes ourselves, but we did not index > >>>> millions of vectors. > >>>> > >>>> I will try to reproduce the crash, maybe this will help us to move > >>>> forward. > >>>> > >>>> Thanks > >>>> > >>>> Michael > >>>> > >>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss: > >>>>>> Can you describe your crash in more detail? > >>>>> I can't. That experiment was a while ago and a quick test to see if I > >>>>> could index rather large-ish USPTO (patent office) data as vectors. > >>>>> Couldn't do it then. > >>>>> > >>>>>> How much RAM? > >>>>> My indexing jobs run with rather smallish heaps to give space for I/O > >>>>> buffers. Think 4-8GB at most. So yes, it could have been the problem. > >>>>> I recall segment merging grew slower and slower and then simply > >>>>> crashed. Lucene should work with low heap requirements, even if it > >>>>> slows down. Throwing ram at the indexing/ segment merging problem > >>>>> is... I don't know - not elegant? > >>>>> > >>>>> Anyway. My main point was to remind folks about how Apache works - > >>>>> code is merged in when there are no vetoes. If Rob (or anybody else) > >>>>> remains unconvinced, he or she can block the change. (I didn't invent > >>>>> those rules). > >>>>> > >>>>> D. > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>> > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org