Michael Sokolov created LUCENE-10590:
----------------------------------------
Summary: Indexing all zero vectors leads to heat death of the
universe
Key: LUCENE-10590
URL: https://issues.apache.org/jira/browse/LUCENE-10590
Project: Lucene - Core
Issue Type: Bug
Reporter: Michael Sokolov
By accident while testing something else, I ran a luceneutil test indexing 1M
100d vectors where all the vectors were all zeroes. This caused indexing to
take a very long time (~40x normal - it did eventually complete) and the search
performance was similarly bad. We should not degrade by orders of magnitude
with even the worst data though.
I'm not entirely sure what the issue is, but perhaps as long as we keep finding
hits that are "better" we keep exploring the graph, where better means (score,
-docid) >= (lowest score, -docid). If that's right and all docs have the same
score, then we probably need to either switch to > (but this could lead to
poorer recall in normal cases) or introduce some kind of minimum score
threshold?
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]