[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497373#comment-17497373
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit d952b3a58114ce5a929211bca7a9b0e822658f35 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d952b3a ]

LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702)

`KnnVectorQuery` currently uses the index reader's hashcode to make sure that
the query it builds runs on the right reader. We had added
`IndexContextReader#id` a while back for a similar purpose with `TermStates`,
let's reuse it?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497369#comment-17497369
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit d47ff38d703c6b5da1ef9c774ccda201fd682b8d in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d47ff38 ]

LUCENE-10382: Use `IndexReaderContext#id` to check reader identity. (#702)

`KnnVectorQuery` currently uses the index reader's hashcode to make sure that
the query it builds runs on the right reader. We had added
`IndexContextReader#id` a while back for a similar purpose with `TermStates`,
let's reuse it?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497161#comment-17497161
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit a3b136573fcb2a1e61dd70519708a5ef36d20eb8 in lucene's branch 
refs/heads/branch_9x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a3b1365 ]

LUCENE-10382: Fix testSearchWithVisitedLimit failures


> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497159#comment-17497159
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit d9c2e46824c8b5be8f471da6ce291e908cc58955 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d9c2e46 ]

LUCENE-10382: Fix testSearchWithVisitedLimit failures


> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497066#comment-17497066
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit 29d4adfe60368c0159cd0accd53efba77ca11771 in lucene's branch 
refs/heads/branch_9x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=29d4adf ]

LUCENE-10382: Ensure kNN filtering works with other codecs (#700)

The original PR that added kNN filtering support overlooked non-default codecs.
This follow-up ensures that other codecs work with the new filtering logic:
* Make sure to check the visited nodes limit in `SimpleTextKnnVectorsReader`
and `Lucene90HnswVectorsReader`
* Add a test `BaseKnnVectorsFormatTestCase` to cover this case
* Fix failures in `TestKnnVectorQuery#testRandomWithFilter`, whose assumptions
don't hold when SimpleText is used

This PR also clarifies the limit checking logic for
`Lucene91HnswVectorsReader`. Now we always check the limit before visiting a
new node, whereas before we only checked it in an outer loop.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497061#comment-17497061
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit b40a750aa8c0cc05291d8d8673d9d068d078d2de in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b40a750 ]

LUCENE-10382: Ensure kNN filtering works with other codecs (#700)

The original PR that added kNN filtering support overlooked non-default codecs.
This follow-up ensures that other codecs work with the new filtering logic:
* Make sure to check the visited nodes limit in `SimpleTextKnnVectorsReader`
and `Lucene90HnswVectorsReader`
* Add a test `BaseKnnVectorsFormatTestCase` to cover this case
* Fix failures in `TestKnnVectorQuery#testRandomWithFilter`, whose assumptions
don't hold when SimpleText is used

This PR also clarifies the limit checking logic for
`Lucene91HnswVectorsReader`. Now we always check the limit before visiting a
new node, whereas before we only checked it in an outer loop.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494283#comment-17494283
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit af40b448227e07e93d12c62f9dcf083b92f6eb51 in lucene's branch 
refs/heads/branch_9x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=af40b44 ]

LUCENE-10382: Support filtering in KnnVectorQuery (#656)

This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein 


> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494181#comment-17494181
 ] 

ASF subversion and git services commented on LUCENE-10382:
--

Commit 8ca372573dba0f4755b982b0c36a2b87aaf4705b in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8ca3725 ]

LUCENE-10382: Support filtering in KnnVectorQuery (#656)

This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein 

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-02-07 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488527#comment-17488527
 ] 

Julie Tibshirani commented on LUCENE-10382:
---

I had some time to try out the dynamic check I mentioned, and it seems to work. 
I opened a PR here that builds off Joel's change: 
https://github.com/apache/lucene/pull/656. It's a draft because there are still 
some big open API questions. Looking forward to hearing your feedback!

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-31 Thread Joel Bernstein (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484759#comment-17484759
 ] 

Joel Bernstein commented on LUCENE-10382:
-

I'm going to start the brute force implementation inside of KnnVectorQuery 
soon. My plan is to advance through the VectorValues using the BitsFilter and 
score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is 
code available that does this already let me know.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-25 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482113#comment-17482113
 ] 

Julie Tibshirani commented on LUCENE-10382:
---

What do you think about breaking it into two steps? These seem okay to ship on 
their own.
1. Joel's PR, plus a very simple fallback strategy. In the query we could check 
if the bit set would exclude more than 85% of documents, and if so, use an 
exact scan instead. Based on my experiments with random filters, 85% is 
conservative, and we're unlikely to see a bad degradation at that point. In the 
worst case, we do an exact scan when we didn't need to and check 15% of 
documents. We could document caveats like Mike mentions.
2. Switch from a static check to a more robust one (maybe adaptive). I have 
some ideas here I'm excited to try out :) 

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-25 Thread Joel Bernstein (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482017#comment-17482017
 ] 

Joel Bernstein commented on LUCENE-10382:
-

I'll keep working on the patch and add some tests and that address the 
suggested changes. We can add the execution plan logic as it becomes more clear.



> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-25 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481836#comment-17481836
 ] 

Michael Sokolov commented on LUCENE-10382:
--

> +1 on figuring out a better execution path before the release, it's going to 
> look bad if setting a filter could make the query perform many times slower 
> than a linear scan

This is fair, but by "release" do we mean – commit to lucene/main? I think so, 
because a 9.1 or later release could be cut from that at any time. So ... let's 
do the work on a feature branch to enable iterating to get it nailed down.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-25 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481636#comment-17481636
 ] 

Adrien Grand commented on LUCENE-10382:
---

+1 on figuring out a better execution path before the release, it's going to 
look bad if setting a filter could make the query perform many times slower 
than a linear scan.

I like the adaptive idea that would bound the overall cost at 2x the cost of a 
linear scan.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-24 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481465#comment-17481465
 ] 

Julie Tibshirani commented on LUCENE-10382:
---

[~sokolov] your multi-step approach makes sense to me. Maybe we could move 
forward with [~jbernste]'s PR as a way to get the query API down. I think we 
should introduce the fallback before releasing the change though. Otherwise 
KnnVectorQuery could easily degrade to brute force performance (or worse!), 
which could really catch users by surprise.

I've been pondering how to automatically choose when to fall back. It indeed 
seems best to compare the cost of HNSW vs. an exact scan, instead of choosing 
an arbitrary cut-off like "filter matches less than 10% of docs". Since the 
cost trade-off depends on data size and other factors, it doesn't seem possible 
to find a constant cutoff that works well across situations. To make it 
practical we might need to expose it as a parameter, which is not very 
user-friendly. So thinking about [~jpountz]'s suggestion for a cost model...

bq.  I wonder if we could develop a cost model of both approaches assuming a 
random distribution of deletions, filter matches and vector values, so that the 
query could compute the cost in both cases for specific values of k, 
Scorer#cost and LeafReader#numDeletedDocs (and maybe some HNSW-specific 
parameters like beamWidth?), and pick the approach that has the lesser cost.

I played around with this, looking at HNSW latencies for various filter 
selectivities. It roughly scales with _log (n) * 1/p_, where _p_ is the filter 
selectivity. (This sort of makes sense: HNSW is roughly logarithmic, but it 
needs to search more and more of the graph as nodes are filtered away.) But to 
compare to brute force, we need a pretty good handle on the scaling constants. 
These are really hard to pin down -- they depend on the HNSW build parameters, 
the search parameter k, even properties of the dataset. Looking at the HNSW 
paper, it gives big-O complexity (under some theoretical assumptions) and 
doesn't show the impact of beamWidth or maxConn.

Instead of developing a static cost model, I wonder if we could try to make the 
choice dynamically:
* If the filter is extremely selective (say it matches fewer than k docs), 
perform a brute-force scan.
* Otherwise we always begin an HNSW search, even when the filter is pretty 
selective. If the graph search has already visited more than _p_ of the total 
documents (or some fraction of that), we stop the search and switch to a brute 
force scan. This bounds the overall cost at 2x the exact scan.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-20 Thread Joel Bernstein (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479810#comment-17479810
 ] 

Joel Bernstein commented on LUCENE-10382:
-

I put up a PR with a Query as the filter. No tests yet, just a first look at a 
possible impl.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-20 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479649#comment-17479649
 ] 

Michael Sokolov commented on LUCENE-10382:
--

> I'm a little fuzzy on the cost computation being discussed. Is this about the 
> decision to do the ANN or fully materialized KNN?

Yes. I wouldn't worry about that at first though. Maybe we can do three steps 
something like this:
 # implement Query-based filter, always using HNSW search that we have today. 
It would have to be marked with some serious caveats about potential 
performance risk, but we should make progress somehow without insisting on the 
full implementation at once. Perhaps we can just document the risk, mark as 
experimental in javadoc?
 # implement full KNN fallback with a fixed cutoff (based on Query cost?)
 # implement an adaptive cost computation

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-20 Thread Joel Bernstein (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479613#comment-17479613
 ] 

Joel Bernstein commented on LUCENE-10382:
-

I think Query makes sense as well. I'm a little fuzzy on the cost computation 
being discussed. Is this about the decision to do the ANN or fully materialized 
KNN? Or is there another cost being discussed that deals with the query being 
passed in as a filter?


> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-20 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479487#comment-17479487
 ] 

Michael Sokolov commented on LUCENE-10382:
--

If we go with a {{{}Query{}}}-based filter, I guess it would still be possible 
to create a query wrapping a {{BitSetProducer}} (like TPBJQ), so it's not as if 
it's a hard decision preventing a customer providing a precomputed bitset - I 
think?

Also, relying on the cache makes sense to me, but I have some reservations. One 
issue we've found is that because it caches entire Query results, it can often 
miss significant caching opportunities, say when a complex {{BooleanQuery}} has 
a subset of clauses that can profitably be cached. Maybe the Query-writer can 
structure their queries to be more cache-friendly by nesting BQs? But then 
again they get rewritten and may be flattened prior to the cache seeing them.

Anyway maybe we can enhance the cache, but this is a separate issue;  +1 to 
move ahead using Query

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-20 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479227#comment-17479227
 ] 

Adrien Grand commented on LUCENE-10382:
---

We have queries like ParentChildrenBlockJoinQuery that take a 
{{BitSetProducer}} so that the query doesn't have to care about producing bit 
sets from queries, it's not its responsibility. In this case though, I think 
the decision should happen in Lucene, since it hopefully has more data to make 
the right decision than users have (e.g. Scorer#cost). If users knew in advance 
what filters they would like to apply, then they should split their indexes 
based on these filters instead of passing filters to Lucene.

I wonder if we could develop a cost model of both approaches assuming a random 
distribution of deletions, filter matches and vector values, so that the query 
could compute the cost in both cases for specific values of {{{}k{}}}, 
{{Scorer#cost}} and {{LeafReader#numDeletedDocs}} (and maybe some HNSW-specific 
parameters like beamWidth?), and pick the approach that has the lesser cost.

LRUQueryCache already happens to cache dense filters (cost > maxdoc / 100) as 
bit sets, which helps with conjunctions for instance, so maybe we would be able 
to reuse it as a way to avoid recomputing bitsets over and over again for 
popular filters.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-19 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479017#comment-17479017
 ] 

Michael Sokolov commented on LUCENE-10382:
--

> I thought about passing in Bits but I don't think it will work because 
> searchLeaf is at the segment level so we'd need segment level Bits.

Oh, good point!

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-19 Thread Joel Bernstein (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479016#comment-17479016
 ] 

Joel Bernstein commented on LUCENE-10382:
-

I thought about passing in Bits but I don't think it will work because 
searchLeaf is at the segment level so we'd need segment level Bits.

We could pass in a Query and collect the bits segment level bits. Would be easy 
to implement as well. 

KnnVectorQuery(String field, float[] target, int k, Query query)

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-01-19 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479014#comment-17479014
 ] 

Michael Sokolov commented on LUCENE-10382:
--

How would this look? An easy first step is to add a filter parameter to 
KnnVectorQuery 

{{  public KnnVectorQuery(String field, float[] target, int k, Bits filter)}}

then it can call {{LeafReader.searchNearestVectors}} with 
{{liveDocs.intersect(filter)}} instead of {{liveDocs.}}

[~julietibs] shared on list a link to a paper showing how the search 
degenerates for highly selective filters. The writers' approach was to fall 
back to "brute force" KNN when selectivity passes a fixed threshold. We could 
do that too, and it makes sense to me, but I guess the question is: where 
should this fallback happen in the API?

The implementation of full (non-approximate) KNN (with a filter) only needs the 
VectorValues iterator which the KnnVectorsReader already provides. It could be 
implemented as part of KnnVectorQuery. Is there a better place?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org