Thank you Anton for bringing closure! And Marc for catching this! Phew :)
Mike McCandless
http://blog.mikemccandless.com
On Thu, Apr 6, 2023 at 5:51 AM Anton Hägerstrand wrote:
> Search profiling seems to work again - see e.g.
>
> We shouldn't accept weakly/not scientifically motivated vetos anyway
right?
In fact we must accept all vetos by any committer as a veto, for a change
to Lucene's source code, regardless of that committer's reasoning. This is
the power of Apache's model.
Of course we all can and will work
Search profiling seems to work again - see e.g.
https://blunders.io/jfr-demo/searching-2023.04.04.18.03.12/jvm_info
Thanks Marc for reporting and Mike for fixing!
/Anton
On Tue, 4 Apr 2023, 17:53 Michael McCandless,
wrote:
> Actually, I spoke too soon. The NIGHTLY_LOG_DIR is indeed a bit
>10 MB hard drive, wow I'll never need another floppy disk ever...
Neural nets... nice idea, but there will never be enough CPU power to run
them...
etc.
Is it possible to make it a configurable limit?
I think Gus is on spot, agree 100%.
Vector dimension is already configurable, it's the max
To be clear Robert, I agree with you in not bumping it just to 2048 or
whatever not motivated enough constant.
But I disagree on the performance perspective:
I mean I am absolutely positive in working to improve the current
performances, but I think this is disconnected from that limit.
Not all
yes, it makes a difference. It will take less time and CPU to do it
all in one go, producing a single segment (assuming the data does not
exceed the IndexWriter RAM buffer size). If you index a lot of little
segments and then force merge them it will take longer, because it had
to build the graphs
Thanks!
I will try to run some tests to be on the safe side :-)
Am 06.04.23 um 16:28 schrieb Michael Sokolov:
yes, it makes a difference. It will take less time and CPU to do it
all in one go, producing a single segment (assuming the data does not
exceed the IndexWriter RAM buffer size). If
> I don't know, Alessandro. I just wanted to point out the fact that by
Apache rules a committer's veto to a code change counts as a no-go.
Yeah Dawid, I was not provocative, I was genuinely asking what should a
pragmatic approach be to choose a limit/remove it, because I don't know how
to
I think we should focus on testing where the limits are and what might
cause the limits.
Let's get out of this fog :-)
Thanks
Michael
Am 06.04.23 um 11:47 schrieb Michael McCandless:
> We shouldn't accept weakly/not scientifically motivated vetos anyway
right?
In fact we must accept all
re: how does this HNSW stuff scale - I think people are calling out
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of
As I said earlier, a max limit limits usability.
It's not forcing users with small vectors to pay the performance penalty of
big vectors, it's literally preventing some users to use
Lucene/Solr/Elasticsearch at all.
As far as I know, the max limit is used to raise an exception, it's not
used to
thanks very much for these insights!
Does it make a difference re RAM when I do a batch import, for example
import 1000 documents and close the IndexWriter and do a forceMerge or
import 1Mio documents at once?
I would expect so, or do I misunderstand this?
Thanks
Michael
Am 06.04.23 um
If we find issues with larger limits, maybe have a configurable limit like we
do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap and do
one query per second.
Where I work (LexisNexis), we have high-value queries, but just not that many
of them per second.
wunder
Walter
I am not sure I get the point to make the limit configurable:
1) if it is configurable, but default max to 1024, it means that we don't
enforce any limit aside the max integer behind the scenes.
So if you want to set a vector dimension for a field to 5000 you need to
first set a MAX compatible
Well, I'm asking ppl actually try to test using such high dimensions.
Based on my own experience, I consider it unusable. It seems other
folks may have run into trouble too. If the project committers can't
even really use vectors with such high dimension counts, then its not
in an OK state for
I'm trying to run a test. I indexed 8M 100d float32 vectors in ~20
minutes with a single thread. I have some 256K vectors, but only about
2M of them. Can anybody point me to a large set (say 8M+) of 1024+ dim
vectors I can use for testing? If all else fails I can test with
noise, but that tends to
Am 06.04.23 um 17:47 schrieb Robert Muir:
Well, I'm asking ppl actually try to test using such high dimensions.
Based on my own experience, I consider it unusable. It seems other
folks may have run into trouble too. If the project committers can't
even really use vectors with such high
> We all need to be on the same page, grounded in reality, not fantasy,
where if we set a limit of 1024 or 2048, that you can actually index
vectors with that many dimensions and it actually works and scales
This is something that I agree with. When we test it, I think we should go
in with the
Great, thank you!
How much RAM; etc. did you run this test on?
Do the vectors really have to be based on real data for testing the
indexing?
I understand, if you want to test the quality of the search results it
does matter, but for testing the scalability itself it should not matter
19 matches
Mail list logo