[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-15 Thread GitBox


benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1049904819


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
* @throws IllegalArgumentException if k is less than 1
*/
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   Tl;dr
   
   Thank you for bearing with me! I think this is a good change.
   
   I would be happy with the JavaDocs, etc. clearly indicating that this 
threshold relates to the un-boosted vector score, not the raw similarity 
calculation. Dot-product, cosine, and euclidean are well defined concepts 
outside of Lucene. Lucene mangles (for undoubtably good reasons) the output of 
these similarities in undocumented ways to fit within boundaries.
   
   > with the current CR,
   
   I don't know what `CR` means. Change request?
   
   > However, this is similar to how result scores are treated elsewhere in 
Lucene - their value ranges are not well-defined;
   
   Agreed, ranges are usually predicated on term statistics, etc. and can 
potentially be considered "unbounded" as the corpus changes. 
   
   However, does Lucene require that all unboosted BM25 scores are between 0-1? 
It does seem like an "arbitrary" decision (to me, I don't know the full-breadth 
of Lucene optimizations, etc. when it comes to scores) to restrict vector 
similarity in this way. But that is a broader conversation. I have some 
learning to do.
   
   >  I guess practically speaking, as a user, I think I am going to have to do 
empirical work to know what threshold to use; these are not likely going to be 
motivated by some a priori knowledge of what a "good" dot-product is
   
   I would argue that a user could have a priori knowledge here. Think of it in 
the use case when the user knows their model used to make the vectors. At that 
point, they 100% know what is considered relevant based on their loss function 
and training + test data. Choosing a dot-product or cosine threshold that fits 
within 90% percentile or something given their test data results.
   
   I agree that this would be different if users were using an "off the shelf" 
model. In that case, they would probably require hybrid-search and combining 
with BM25 to get anything like relevant results (boosting various queries 
accordingly). Thus, learning what settings are required in an unfiltered case.
   
   > if we were to switch to using vector similarities that would correspond 
more directly to the underlying functions, we would have to clearly define them
   
   Cosine, dot-product, euclidean, are all already well defined. The functions 
to calculate them are universally recognized. Where Lucene separates itself is 
the manipulation of the similarity output to fit into a range [0, 1]. I guess 
this is cost of doing business in Lucene.
   
   I am not suggesting that all scoring of vector document searches changes. 
Simply that "similarity" and "score" are related, but are different things. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-14 Thread GitBox


benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1048856364


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
* @throws IllegalArgumentException if k is less than 1
*/
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   @msokolov you haven't missed anything. I am specifically talking about users 
providing `similarityThreshold` to the query. If they have calculating that 
they want a specific `cosine` or `dotProduct` similarity, they would then need 
to adjust that to match Lucene's scoring transformation.
   
   I think that `similarityThreshold` should mean vector similarities. We can 
transform it for the user to reflect the score that similarity represents 
(given vector encoding type and similarity function).
   
   
   An example here is `dotProduct`. The user knows they want `FLOAT32` vectors 
within a dotProduct of 0.7. With this API that ACTUALLY means they want to 
limit the scores to .85 (`(1 + dotProduct)/2`). How is the user supposed to 
know that?
   
   This seems really weird to me.
   
   This doesn't take into account the different scoring methods between vector 
types as well, which can get even more confusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-14 Thread GitBox


benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1048764283


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
* @throws IllegalArgumentException if k is less than 1
*/
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   As it is, callers of this method need to know the inner nuances of how we 
calculate the score given the similarity. I would prefer them not having to 
know that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-14 Thread GitBox


benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1048763428


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
* @throws IllegalArgumentException if k is less than 1
*/
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   So, right now, `similarityThreshold` is really `scoreThreshold`. I would 
prefer it to be the actual similarity of the vectors and NOT how we translate 
it to scores (which for ByteVectors has some surprising behavior). 
   
   @msokolov What do you think here? 
   
   Its already tricky that "similarity implies score". But truly, similarity != 
score.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

2022-12-06 Thread GitBox


benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1040969450


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -37,6 +37,7 @@
  * @param  the type of query vector
  */
 public class HnswGraphSearcher {
+  private final int UNBOUNDED_QUEUE_INIT_SIZE = 10_000;

Review Comment:
   Any research to indicate why this number was chosen? It seems silly that if 
a user provides `k = 10_001` it would have a queue bigger than `k = 
Integer.MAX_VALUE`.
   
   Technically, the max value here should be something like 
`ArrayUtil.MAX_ARRAY_LENGTH` But this eagerly allocates a `new 
long[heapSize];`. This is VERY costly.
   
   I would prefer a number with some significant reason behind it or some 
better way of queueing neighbors.



##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##
@@ -235,7 +312,7 @@ private NeighborQueue searchLevel(
 while (candidates.size() > 0 && results.incomplete() == false) {
   // get the best candidate (closest or best scoring)
   float topCandidateSimilarity = candidates.topScore();
-  if (topCandidateSimilarity < minAcceptedSimilarity) {
+  if (topCandidateSimilarity < minAcceptedSimilarity && results.size() >= 
topK) {
 break;
   }

Review Comment:
   I am not sure about this. This stops gathering results once its filled. This 
defeats the purpose of exploring the graph.
   
   Have you seen how this effects recall?



##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -232,8 +232,48 @@ public final PostingsEnum postings(Term term) throws 
IOException {
* @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
* @lucene.experimental
*/
+  public final TopDocs searchNearestVectors(
+  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {
+return searchNearestVectors(
+field, target, k, Float.NEGATIVE_INFINITY, acceptDocs, visitedLimit);
+  }
+
+  /**
+   * Return the k nearest neighbor documents as determined by comparison of 
their vector values for
+   * this field, to the given vector, by the field's similarity function. The 
score of each document
+   * is derived from the vector similarity in a way that ensures scores are 
positive and that a
+   * larger score corresponds to a higher ranking.
+   *
+   * The search is allowed to be approximate, meaning the results are not 
guaranteed to be the
+   * true k closest neighbors. For large values of k (for example when k is 
close to the total
+   * number of documents), the search may also retrieve fewer than k documents.
+   *
+   * The returned {@link TopDocs} will contain a {@link ScoreDoc} for each 
nearest neighbor,
+   * sorted in order of their similarity to the query vector (decreasing 
scores). The {@link
+   * TotalHits} contains the number of documents visited during the search. If 
the search stopped
+   * early because it hit {@code visitedLimit}, it is indicated through the 
relation {@code
+   * TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO}.
+   *
+   * @param field the vector field to search
+   * @param target the vector-valued query
+   * @param k the number of docs to return (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   Would it be possible for this threshold to be an actual distance? My concern 
here is that for things like `byteVectors`, dot-product scores are insanely 
small (I think this is a design flaw in itself) and may be confusing to users 
who want a given "radius" but instead have to figure out a score related to 
their radius. 
   
   It would be prudent that IF we provided some filtering on a threshold within 
the search, that this threshold reflects vector distance directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org