benwtrent opened a new pull request, #12413:
URL: https://github.com/apache/lucene/pull/12413
We have some weird behavior in HNSW searcher when finding the candidate
entry point for the zeroth layer.
While trying to find the best entry point to gather the full candidate sets,
we don't filter based on the acceptableOrds bitset. Consequently, if we exist
the search early (before hitting the zeroth layer), the results that are
returned may contain documents NOT within that bitset.
Luckily since the results are marked as incomplete, the `*VectorQuery` logic
switches back to an exact scan and throws away the results.
However, if any user called the leaf searcher directly, bypassing the query,
they could run into this bug.
I ran performance tests and there were no significant latency increases.
There do seem to be observable latency decreases though at higher `maxConn`
levels.
I am getting slightly different recall. Usually better by 0.001, but worse
by `0.001` on glove 100 with
```
nDoc fanout maxConn beamWidth
100000 20 96 500 120
```
so I am digging into why that may be. Any help there is appreciated.
Data (lucene util knnPerf):
```
dim = 100
doc_vectors = constants.GLOVE_VECTOR_DOCS_FILE
query_vectors = '%s/util/tasks/vector-task-100d.vec' % constants.BASE_DIR
```
Settings ran:
```
VALUES = {
'ndoc': (100000,),
'maxConn': (32, 96),
'beamWidthIndex': (250, 500,),
'fanout': (20, 100,),
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]