Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Wechner Thu, 06 Apr 2023 07:33:56 -0700

Thanks!

I will try to run some tests to be on the safe side :-)


Am 06.04.23 um 16:28 schrieb Michael Sokolov:

yes, it makes a difference. It will take less time and CPU to do it
all in one go, producing a single segment (assuming the data does not
exceed the IndexWriter RAM buffer size). If you index a lot of little
segments and then force merge them it will take longer, because it had
to build the graphs for the little segments, and then for the big one
when merging, and it will eventually use the same amount of RAM to
build the big graph, although I don't believe it will have to load the
vectors en masse into RAM while merging.

On Thu, Apr 6, 2023 at 10:20 AM Michael Wechner
<[email protected]> wrote:

thanks very much for these insights!

Does it make a difference re RAM when I do a batch import, for example
import 1000 documents and close the IndexWriter and do a forceMerge or
import 1Mio documents at once?

I would expect so, or do I misunderstand this?

Thanks

Michael



Am 06.04.23 um 16:11 schrieb Michael Sokolov:

re: how does this HNSW stuff scale - I think people are calling out
indexing memory usage here, so let's discuss some facts. During
initial indexing we hold in RAM all the vector data and the graph
constructed from the new documents, but this is accounted for and
limited by the size of IndexWriter's buffer; the document vectors and
their graph will be flushed to disk when this fills up, and at search
time, they are not read in wholesale to RAM. There is potentially
unbounded RAM usage during merging though, because the entire merged
graph will be built in RAM. I lost track of how we handle the vector
data now, but at least in theory it should be fairly straightforward
to write the merged vector data in chunks using only limited RAM. So
how much RAM does the graph use? It uses numdocs*fanout VInts.
Actually it doesn't really scale with the vector dimension at all -
rather it scales with the graph fanout (M) parameter and with the
total number of documents. So I think this focus on limiting the
vector dimension is not helping to address the concern about RAM usage
while merging.

The vector dimension does have a strong role in the search, and
indexing time, but the impact is linear in the dimension and won't
exhaust any limited resource.

On Thu, Apr 6, 2023 at 5:48 AM Michael McCandless
<[email protected]> wrote:

We shouldn't accept weakly/not scientifically motivated vetos anyway right?

In fact we must accept all vetos by any committer as a veto, for a change to 
Lucene's source code, regardless of that committer's reasoning.  This is the 
power of Apache's model.

Of course we all can and will work together to convince one another (this is 
where the scientifically motivated part comes in) to change our votes, one way 
or another.

I'd ask anyone voting +1 to raise this limit to at least try to index a few 
million vectors with 756 or 1024, which is allowed today.

+1, if the current implementation really does not scale / needs more and more 
RAM for merging, let's understand what's going on here, first, before 
increasing limits.  I rescind my hasty +1 for now!

Mike McCandless

http://blog.mikemccandless.com


On Wed, Apr 5, 2023 at 11:22 AM Alessandro Benedetti <[email protected]> 
wrote:

Ok, so what should we do then?
This space is moving fast, and in my opinion we should act fast to release and 
guarantee we attract as many users as possible.

At the same time I am not saying we should proceed blind, if there's concrete 
evidence for setting a limit rather than another, or that a certain limit is 
detrimental to the project, I think that veto should be valid.

We shouldn't accept weakly/not scientifically motivated vetos anyway right?

The problem I see is that more than voting we should first decide this limit 
and I don't know how we can operate.
I am imagining like a poll where each entry is a limit + motivation  and PMCs 
maybe vote/add entries?

Did anything similar happen in the past? How was the current limit added?


On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <[email protected]> wrote:

Should create a VOTE thread, where we propose some values with a justification 
and we vote?

Technically, a vote thread won't help much if there's no full consensus - a 
single veto will make the patch unacceptable for merging.
https://www.apache.org/foundation/voting.html#Veto

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to