[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497373#comment-17497373 ] ASF subversion and git services commented on LUCENE-10382: -- Commit d952b3a58114ce5a929211bca7a9b0e822658f35 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d952b3a ] LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702) `KnnVectorQuery` currently uses the index reader's hashcode to make sure that the query it builds runs on the right reader. We had added `IndexContextReader#id` a while back for a similar purpose with `TermStates`, let's reuse it? > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 7h 50m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497369#comment-17497369 ] ASF subversion and git services commented on LUCENE-10382: -- Commit d47ff38d703c6b5da1ef9c774ccda201fd682b8d in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d47ff38 ] LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702) `KnnVectorQuery` currently uses the index reader's hashcode to make sure that the query it builds runs on the right reader. We had added `IndexContextReader#id` a while back for a similar purpose with `TermStates`, let's reuse it? > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 7h 40m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497161#comment-17497161 ] ASF subversion and git services commented on LUCENE-10382: -- Commit a3b136573fcb2a1e61dd70519708a5ef36d20eb8 in lucene's branch refs/heads/branch_9x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a3b1365 ] LUCENE-10382: Fix testSearchWithVisitedLimit failures > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 7h 40m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497159#comment-17497159 ] ASF subversion and git services commented on LUCENE-10382: -- Commit d9c2e46824c8b5be8f471da6ce291e908cc58955 in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d9c2e46 ] LUCENE-10382: Fix testSearchWithVisitedLimit failures > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 7h 40m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497066#comment-17497066 ] ASF subversion and git services commented on LUCENE-10382: -- Commit 29d4adfe60368c0159cd0accd53efba77ca11771 in lucene's branch refs/heads/branch_9x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=29d4adf ] LUCENE-10382: Ensure kNN filtering works with other codecs (#700) The original PR that added kNN filtering support overlooked non-default codecs. This follow-up ensures that other codecs work with the new filtering logic: * Make sure to check the visited nodes limit in `SimpleTextKnnVectorsReader` and `Lucene90HnswVectorsReader` * Add a test `BaseKnnVectorsFormatTestCase` to cover this case * Fix failures in `TestKnnVectorQuery#testRandomWithFilter`, whose assumptions don't hold when SimpleText is used This PR also clarifies the limit checking logic for `Lucene91HnswVectorsReader`. Now we always check the limit before visiting a new node, whereas before we only checked it in an outer loop. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 7h 40m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497061#comment-17497061 ] ASF subversion and git services commented on LUCENE-10382: -- Commit b40a750aa8c0cc05291d8d8673d9d068d078d2de in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b40a750 ] LUCENE-10382: Ensure kNN filtering works with other codecs (#700) The original PR that added kNN filtering support overlooked non-default codecs. This follow-up ensures that other codecs work with the new filtering logic: * Make sure to check the visited nodes limit in `SimpleTextKnnVectorsReader` and `Lucene90HnswVectorsReader` * Add a test `BaseKnnVectorsFormatTestCase` to cover this case * Fix failures in `TestKnnVectorQuery#testRandomWithFilter`, whose assumptions don't hold when SimpleText is used This PR also clarifies the limit checking logic for `Lucene91HnswVectorsReader`. Now we always check the limit before visiting a new node, whereas before we only checked it in an outer loop. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 7.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494283#comment-17494283 ] ASF subversion and git services commented on LUCENE-10382: -- Commit af40b448227e07e93d12c62f9dcf083b92f6eb51 in lucene's branch refs/heads/branch_9x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=af40b44 ] LUCENE-10382: Support filtering in KnnVectorQuery (#656) This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better. Co-authored-by: Joel Bernstein > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 5h 40m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494181#comment-17494181 ] ASF subversion and git services commented on LUCENE-10382: -- Commit 8ca372573dba0f4755b982b0c36a2b87aaf4705b in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8ca3725 ] LUCENE-10382: Support filtering in KnnVectorQuery (#656) This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better. Co-authored-by: Joel Bernstein > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 5h 40m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488527#comment-17488527 ] Julie Tibshirani commented on LUCENE-10382: --- I had some time to try out the dynamic check I mentioned, and it seems to work. I opened a PR here that builds off Joel's change: https://github.com/apache/lucene/pull/656. It's a draft because there are still some big open API questions. Looking forward to hearing your feedback! > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484759#comment-17484759 ] Joel Bernstein commented on LUCENE-10382: - I'm going to start the brute force implementation inside of KnnVectorQuery soon. My plan is to advance through the VectorValues using the BitsFilter and score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is code available that does this already let me know. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482113#comment-17482113 ] Julie Tibshirani commented on LUCENE-10382: --- What do you think about breaking it into two steps? These seem okay to ship on their own. 1. Joel's PR, plus a very simple fallback strategy. In the query we could check if the bit set would exclude more than 85% of documents, and if so, use an exact scan instead. Based on my experiments with random filters, 85% is conservative, and we're unlikely to see a bad degradation at that point. In the worst case, we do an exact scan when we didn't need to and check 15% of documents. We could document caveats like Mike mentions. 2. Switch from a static check to a more robust one (maybe adaptive). I have some ideas here I'm excited to try out :) > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482017#comment-17482017 ] Joel Bernstein commented on LUCENE-10382: - I'll keep working on the patch and add some tests and that address the suggested changes. We can add the execution plan logic as it becomes more clear. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481836#comment-17481836 ] Michael Sokolov commented on LUCENE-10382: -- > +1 on figuring out a better execution path before the release, it's going to > look bad if setting a filter could make the query perform many times slower > than a linear scan This is fair, but by "release" do we mean – commit to lucene/main? I think so, because a 9.1 or later release could be cut from that at any time. So ... let's do the work on a feature branch to enable iterating to get it nailed down. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481636#comment-17481636 ] Adrien Grand commented on LUCENE-10382: --- +1 on figuring out a better execution path before the release, it's going to look bad if setting a filter could make the query perform many times slower than a linear scan. I like the adaptive idea that would bound the overall cost at 2x the cost of a linear scan. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481465#comment-17481465 ] Julie Tibshirani commented on LUCENE-10382: --- [~sokolov] your multi-step approach makes sense to me. Maybe we could move forward with [~jbernste]'s PR as a way to get the query API down. I think we should introduce the fallback before releasing the change though. Otherwise KnnVectorQuery could easily degrade to brute force performance (or worse!), which could really catch users by surprise. I've been pondering how to automatically choose when to fall back. It indeed seems best to compare the cost of HNSW vs. an exact scan, instead of choosing an arbitrary cut-off like "filter matches less than 10% of docs". Since the cost trade-off depends on data size and other factors, it doesn't seem possible to find a constant cutoff that works well across situations. To make it practical we might need to expose it as a parameter, which is not very user-friendly. So thinking about [~jpountz]'s suggestion for a cost model... bq. I wonder if we could develop a cost model of both approaches assuming a random distribution of deletions, filter matches and vector values, so that the query could compute the cost in both cases for specific values of k, Scorer#cost and LeafReader#numDeletedDocs (and maybe some HNSW-specific parameters like beamWidth?), and pick the approach that has the lesser cost. I played around with this, looking at HNSW latencies for various filter selectivities. It roughly scales with _log (n) * 1/p_, where _p_ is the filter selectivity. (This sort of makes sense: HNSW is roughly logarithmic, but it needs to search more and more of the graph as nodes are filtered away.) But to compare to brute force, we need a pretty good handle on the scaling constants. These are really hard to pin down -- they depend on the HNSW build parameters, the search parameter k, even properties of the dataset. Looking at the HNSW paper, it gives big-O complexity (under some theoretical assumptions) and doesn't show the impact of beamWidth or maxConn. Instead of developing a static cost model, I wonder if we could try to make the choice dynamically: * If the filter is extremely selective (say it matches fewer than k docs), perform a brute-force scan. * Otherwise we always begin an HNSW search, even when the filter is pretty selective. If the graph search has already visited more than _p_ of the total documents (or some fraction of that), we stop the search and switch to a brute force scan. This bounds the overall cost at 2x the exact scan. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479810#comment-17479810 ] Joel Bernstein commented on LUCENE-10382: - I put up a PR with a Query as the filter. No tests yet, just a first look at a possible impl. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479649#comment-17479649 ] Michael Sokolov commented on LUCENE-10382: -- > I'm a little fuzzy on the cost computation being discussed. Is this about the > decision to do the ANN or fully materialized KNN? Yes. I wouldn't worry about that at first though. Maybe we can do three steps something like this: # implement Query-based filter, always using HNSW search that we have today. It would have to be marked with some serious caveats about potential performance risk, but we should make progress somehow without insisting on the full implementation at once. Perhaps we can just document the risk, mark as experimental in javadoc? # implement full KNN fallback with a fixed cutoff (based on Query cost?) # implement an adaptive cost computation > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479613#comment-17479613 ] Joel Bernstein commented on LUCENE-10382: - I think Query makes sense as well. I'm a little fuzzy on the cost computation being discussed. Is this about the decision to do the ANN or fully materialized KNN? Or is there another cost being discussed that deals with the query being passed in as a filter? > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479487#comment-17479487 ] Michael Sokolov commented on LUCENE-10382: -- If we go with a {{{}Query{}}}-based filter, I guess it would still be possible to create a query wrapping a {{BitSetProducer}} (like TPBJQ), so it's not as if it's a hard decision preventing a customer providing a precomputed bitset - I think? Also, relying on the cache makes sense to me, but I have some reservations. One issue we've found is that because it caches entire Query results, it can often miss significant caching opportunities, say when a complex {{BooleanQuery}} has a subset of clauses that can profitably be cached. Maybe the Query-writer can structure their queries to be more cache-friendly by nesting BQs? But then again they get rewritten and may be flattened prior to the cache seeing them. Anyway maybe we can enhance the cache, but this is a separate issue; +1 to move ahead using Query > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479227#comment-17479227 ] Adrien Grand commented on LUCENE-10382: --- We have queries like ParentChildrenBlockJoinQuery that take a {{BitSetProducer}} so that the query doesn't have to care about producing bit sets from queries, it's not its responsibility. In this case though, I think the decision should happen in Lucene, since it hopefully has more data to make the right decision than users have (e.g. Scorer#cost). If users knew in advance what filters they would like to apply, then they should split their indexes based on these filters instead of passing filters to Lucene. I wonder if we could develop a cost model of both approaches assuming a random distribution of deletions, filter matches and vector values, so that the query could compute the cost in both cases for specific values of {{{}k{}}}, {{Scorer#cost}} and {{LeafReader#numDeletedDocs}} (and maybe some HNSW-specific parameters like beamWidth?), and pick the approach that has the lesser cost. LRUQueryCache already happens to cache dense filters (cost > maxdoc / 100) as bit sets, which helps with conjunctions for instance, so maybe we would be able to reuse it as a way to avoid recomputing bitsets over and over again for popular filters. > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479017#comment-17479017 ] Michael Sokolov commented on LUCENE-10382: -- > I thought about passing in Bits but I don't think it will work because > searchLeaf is at the segment level so we'd need segment level Bits. Oh, good point! > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479016#comment-17479016 ] Joel Bernstein commented on LUCENE-10382: - I thought about passing in Bits but I don't think it will work because searchLeaf is at the segment level so we'd need segment level Bits. We could pass in a Query and collect the bits segment level bits. Would be easy to implement as well. KnnVectorQuery(String field, float[] target, int k, Query query) > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs
[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479014#comment-17479014 ] Michael Sokolov commented on LUCENE-10382: -- How would this look? An easy first step is to add a filter parameter to KnnVectorQuery {{ public KnnVectorQuery(String field, float[] target, int k, Bits filter)}} then it can call {{LeafReader.searchNearestVectors}} with {{liveDocs.intersect(filter)}} instead of {{liveDocs.}} [~julietibs] shared on list a link to a paper showing how the search degenerates for highly selective filters. The writers' approach was to fall back to "brute force" KNN when selectivity passes a fixed threshold. We could do that too, and it makes sense to me, but I guess the question is: where should this fallback happen in the API? The implementation of full (non-approximate) KNN (with a filter) only needs the VectorValues iterator which the KnnVectorsReader already provides. It could be implemented as part of KnnVectorQuery. Is there a better place? > Allow KnnVectorQuery to operate over a subset of liveDocs > - > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Joel Bernstein >Priority: Major > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org