Re: Blunders profiling for Searching is broken

2023-04-06 Thread Michael McCandless
Thank you Anton for bringing closure! And Marc for catching this! Phew :) Mike McCandless http://blog.mikemccandless.com On Thu, Apr 6, 2023 at 5:51 AM Anton Hägerstrand wrote: > Search profiling seems to work again - see e.g. >

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael McCandless
> We shouldn't accept weakly/not scientifically motivated vetos anyway right? In fact we must accept all vetos by any committer as a veto, for a change to Lucene's source code, regardless of that committer's reasoning. This is the power of Apache's model. Of course we all can and will work

Re: Blunders profiling for Searching is broken

2023-04-06 Thread Anton Hägerstrand
Search profiling seems to work again - see e.g. https://blunders.io/jfr-demo/searching-2023.04.04.18.03.12/jvm_info Thanks Marc for reporting and Mike for fixing! /Anton On Tue, 4 Apr 2023, 17:53 Michael McCandless, wrote: > Actually, I spoke too soon. The NIGHTLY_LOG_DIR is indeed a bit

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Alessandro Benedetti
>10 MB hard drive, wow I'll never need another floppy disk ever... Neural nets... nice idea, but there will never be enough CPU power to run them... etc. Is it possible to make it a configurable limit? I think Gus is on spot, agree 100%. Vector dimension is already configurable, it's the max

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Alessandro Benedetti
To be clear Robert, I agree with you in not bumping it just to 2048 or whatever not motivated enough constant. But I disagree on the performance perspective: I mean I am absolutely positive in working to improve the current performances, but I think this is disconnected from that limit. Not all

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Sokolov
yes, it makes a difference. It will take less time and CPU to do it all in one go, producing a single segment (assuming the data does not exceed the IndexWriter RAM buffer size). If you index a lot of little segments and then force merge them it will take longer, because it had to build the graphs

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner
Thanks! I will try to run some tests to be on the safe side :-) Am 06.04.23 um 16:28 schrieb Michael Sokolov: yes, it makes a difference. It will take less time and CPU to do it all in one go, producing a single segment (assuming the data does not exceed the IndexWriter RAM buffer size). If

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Alessandro Benedetti
> I don't know, Alessandro. I just wanted to point out the fact that by Apache rules a committer's veto to a code change counts as a no-go. Yeah Dawid, I was not provocative, I was genuinely asking what should a pragmatic approach be to choose a limit/remove it, because I don't know how to

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner
I think we should focus on testing where the limits are and what might cause the limits. Let's get out of this fog :-) Thanks Michael Am 06.04.23 um 11:47 schrieb Michael McCandless: > We shouldn't accept weakly/not scientifically motivated vetos anyway right? In fact we must accept all

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Sokolov
re: how does this HNSW stuff scale - I think people are calling out indexing memory usage here, so let's discuss some facts. During initial indexing we hold in RAM all the vector data and the graph constructed from the new documents, but this is accounted for and limited by the size of

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Alessandro Benedetti
As I said earlier, a max limit limits usability. It's not forcing users with small vectors to pay the performance penalty of big vectors, it's literally preventing some users to use Lucene/Solr/Elasticsearch at all. As far as I know, the max limit is used to raise an exception, it's not used to

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner
thanks very much for these insights! Does it make a difference re RAM when I do a batch import, for example import 1000 documents and close the IndexWriter and do a forceMerge or import 1Mio documents at once? I would expect so, or do I misunderstand this? Thanks Michael Am 06.04.23 um

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Walter Underwood
If we find issues with larger limits, maybe have a configurable limit like we do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap and do one query per second. Where I work (LexisNexis), we have high-value queries, but just not that many of them per second. wunder Walter

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Alessandro Benedetti
I am not sure I get the point to make the limit configurable: 1) if it is configurable, but default max to 1024, it means that we don't enforce any limit aside the max integer behind the scenes. So if you want to set a vector dimension for a field to 5000 you need to first set a MAX compatible

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Robert Muir
Well, I'm asking ppl actually try to test using such high dimensions. Based on my own experience, I consider it unusable. It seems other folks may have run into trouble too. If the project committers can't even really use vectors with such high dimension counts, then its not in an OK state for

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Sokolov
I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20 minutes with a single thread. I have some 256K vectors, but only about 2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim vectors I can use for testing? If all else fails I can test with noise, but that tends to

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner
Am 06.04.23 um 17:47 schrieb Robert Muir: Well, I'm asking ppl actually try to test using such high dimensions. Based on my own experience, I consider it unusable. It seems other folks may have run into trouble too. If the project committers can't even really use vectors with such high

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Marcus Eagan
> We all need to be on the same page, grounded in reality, not fantasy, where if we set a limit of 1024 or 2048, that you can actually index vectors with that many dimensions and it actually works and scales This is something that I agree with. When we test it, I think we should go in with the

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Michael Wechner
Great, thank you! How much RAM; etc. did you run this test on? Do the vectors really have to be based on real data for testing the indexing? I understand, if you want to test the quality of the search results it does matter, but for testing the scalability itself it should not matter