benwtrent commented on PR #14173:
URL: https://github.com/apache/lucene/pull/14173#issuecomment-2634182175
> Java limits the size of arrays (and lists) to 'int max' and does not allow
'long' array indices. These will need to be changed to use a different data
structure.
Yeah, I don't like that we have this right now for `int` values at all. I
don't immediately know how to solve this, but its pretty dang expensive as it
is now.
> We use bitsets on vector ordinals for multiple use cases, and almost all
bitsets work with integers.
Correct, filtering though is always on `doc` ids, we filter on the doc-ids
so we are ok there.
I understand the complexity of keeping track of visitation to prevent
re-visiting a node. However, I would argue that any graph based or modern
vector index needs to keep track of the vectors it has visited to prevent
recalculating the score or prevent further exploration down a previously
explored path.
> For example, we could no longer assume that maxOrdinal is the graph size.
We can easily write out the graph size and indicate the graph size is the
count of vectors (which is the sum of non-deleted vectors). True, it gets
complex with deleted docs. Likely we would need to iterate and count during
merges, though I would expect that to be a minor cost in addition to all the
other work we currently do during merging.
I don't think this is a particular problem.
> Given the volume of changes with "long" graph nodeIds, I wonder if we
should do it when we add a new ANN implementation (like DiskANN maybe?).
I don't understand how DiskANN would solve any of the previously expressed
problems.
- DiskANN still requires the graph to be available when doing insertion and
querying.
- We still need to keep track of filtering on docs (so bit set filtering on
doc ids, which is OK)
- During graph exploration, we would need to keep track of vectors already
visited.
- We would still need to resolve vector ordinal -> doc_id
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]