[ 
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478275#comment-17478275
 ] 

ASF subversion and git services commented on LUCENE-10040:
----------------------------------------------------------

Commit 00a7d5f1708a6ce43960d1b8b66f2aebd9568fc6 in lucene's branch 
refs/heads/branch_9x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=00a7d5f ]

LUCENE-10040: Update HnswGraph javadoc related to deletions

Previously it claimed the search method did not handle deletions.


> Handle deletions in nearest vector search
> -----------------------------------------
>
>                 Key: LUCENE-10040
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10040
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Assignee: Julie Tibshirani
>            Priority: Major
>             Fix For: 9.0
>
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even 
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be 
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be 
> surprising + difficult for users, since other search APIs account for deleted 
> docs. We've discussed extending the search logic to take a parameter like 
> {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. 
> This behavior seems hard to work with as a user: fewer than {{k}} docs might 
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
> while assembling its candidate list. It would traverse further into the 
> graph, visiting more nodes to ensure it gathers the required candidates. 
> (Note deleted docs would still be visited/ traversed). The [hnswlib 
> library|https://github.com/nmslib/hnswlib] contains an implementation like 
> this, where you can mark documents as deleted and they're skipped during 
> search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't 
> have a huge effect. But in the worst case, a segment could have 50% deleted 
> docs, and they all happen to be near the query vector. HNSW would need to 
> traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into 
> how well this performs in terms of recall. I have a vague intuition it could 
> be harder to achieve high recall as the algorithm traverses areas further 
> from the "natural" entry points. The HNSW paper doesn't mention deletions/ 
> filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: 
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very 
> similar to applying deleted docs): 
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
> The "Exact vs Approximate" section shows good performance even when a large 
> percentage of documents are filtered out. The team mentioned to me they 
> didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to