benwtrent opened a new issue, #15132: URL: https://github.com/apache/lucene/issues/15132
### Description While we have been doing a ton of work on making sparse filtering better, Lucene does poorly with very dense filters. Part of the problem is that we just haven't measured it well in Lucene benchmarks. I noticed lucene util's filtering tests don't actually do a realistic job of indicating how a filter behaves (I am fixing this now, PR soon). The cause is when users pre-filter and the filter is very dense (e.g. 90+% of docs pass the filter), Lucene throws QPS out the window by eagerly evaluating a very large filter that iterates many docs, costing way more than it would to simply gather some nearest neighbor vectors. It is likely much better to simply oversample a bit on the graph search and then apply the filter as a post filter. This threshold will be tricky to figure out. Though I think we can use the "expected vector ops" compared with the filter threshold to make a "guess" on how expensive it would be to "gather more vectors" vs. just applying the filter. Some initial experiments I have done shows that eagerly evaluating the cheapest of filters (one where its literal constant time to iterate each doc and put it in a bit set), can cause a 5+x slow down vs. just a post filter. I looked at the new AcceptedDocs API, and I am not sure it will actually work well for this as it would require the format to first collect docs, then filter, then pass them to the KnnCollector again. Maybe that's ok? My initial thought is that this belongs in the query. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org