benwtrent opened a new issue, #15132:
URL: https://github.com/apache/lucene/issues/15132

   ### Description
   
   While we have been doing a ton of work on making sparse filtering better, 
Lucene does poorly with very dense filters. 
   
   Part of the problem is that we just haven't measured it well in Lucene 
benchmarks. I noticed lucene util's filtering tests don't actually do a 
realistic job of indicating how a filter behaves (I am fixing this now, PR 
soon).
   
   
   The cause is when users pre-filter and the filter is very dense (e.g. 90+% 
of docs pass the filter), Lucene throws QPS out the window by eagerly 
evaluating a very large filter that iterates many docs, costing way more than 
it would to simply gather some nearest neighbor vectors.
   
   
   It is likely much better to simply oversample a bit on the graph search and 
then apply the filter as a post filter. 
   
   This threshold will be tricky to figure out. Though I think we can use the 
"expected vector ops" compared with the filter threshold to make a "guess" on 
how expensive it would be to "gather more vectors" vs. just applying the filter.
   
   Some initial experiments I have done shows that eagerly evaluating the 
cheapest of filters (one where its literal constant time to iterate each doc 
and put it in a bit set), can cause a 5+x slow down vs. just a post filter. 
   
   
   I looked at the new AcceptedDocs API, and I am not sure it will actually 
work well for this as it would require the format to first collect docs, then 
filter, then pass them to the KnnCollector again.
   
   Maybe that's ok? My initial thought is that this belongs in the query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to