[
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alessandro Benedetti updated LUCENE-10040:
------------------------------------------
Labels: vector-based-search (was: )
> Handle deletions in nearest vector search
> -----------------------------------------
>
> Key: LUCENE-10040
> URL: https://issues.apache.org/jira/browse/LUCENE-10040
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Julie Tibshirani
> Assignee: Julie Tibshirani
> Priority: Major
> Labels: vector-based-search
> Fix For: 9.0
>
> Time Spent: 5h 40m
> Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be
> surprising + difficult for users, since other search APIs account for deleted
> docs. We've discussed extending the search logic to take a parameter like
> {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search.
> This behavior seems hard to work with as a user: fewer than {{k}} docs might
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs
> while assembling its candidate list. It would traverse further into the
> graph, visiting more nodes to ensure it gathers the required candidates.
> (Note deleted docs would still be visited/ traversed). The [hnswlib
> library|https://github.com/nmslib/hnswlib] contains an implementation like
> this, where you can mark documents as deleted and they're skipped during
> search.
> This approach seems reasonable to me, but there are some challenges:
> * Performance can be unpredictable. If deletions are random, it shouldn't
> have a huge effect. But in the worst case, a segment could have 50% deleted
> docs, and they all happen to be near the query vector. HNSW would need to
> traverse through around half the entire graph to collect neighbors.
> * As far as I know, there hasn't been academic research or any testing into
> how well this performs in terms of recall. I have a vague intuition it could
> be harder to achieve high recall as the algorithm traverses areas further
> from the "natural" entry points. The HNSW paper doesn't mention deletions/
> filtering, and I haven't seen community benchmarks around it.
> Background links:
> * Thoughts on deletions from the author of the HNSW paper:
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
> * Blog from Vespa team which mentions combining KNN and search filters (very
> similar to applying deleted docs):
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/].
> The "Exact vs Approximate" section shows good performance even when a large
> percentage of documents are filtered out. The team mentioned to me they
> didn't have the chance to measure recall, only latency.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]