[GitHub] [lucene] msokolov commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

GitBox Tue, 31 May 2022 06:06:54 -0700


msokolov commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r885611627



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##########
@@ -109,12 +117,15 @@ public int[] nodes() {
 
   /** Returns the top element's node id. */
   public int topNode() {
-    return (int) order.apply(heap.top());
+    return decodeNodeId(heap.top());
   }
 
-  /** Returns the top element's node score. */
-  public float topScore() {
-    return NumericUtils.sortableIntToFloat((int) (order.apply(heap.top()) >> 
32));
+  /**
+   * Returns the top element's node score. For the min heap this is the 
minimum score. For the max

Review Comment:
   To me "top" means "top of the heap" - I think it's kind of standard, and not 
the same as "max"



##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswGraphBuilder.java:
##########
@@ -51,7 +50,7 @@ public final class Lucene90HnswGraphBuilder {
   private final VectorSimilarityFunction similarityFunction;
   private final RandomAccessVectorValues vectorValues;
   private final SplittableRandom random;
-  private final BoundsChecker bound;
+  private float minAcceptedSimilarity = Float.POSITIVE_INFINITY;

Review Comment:
   yes, we should not be modifying anything in backward-codecs. On the whole 
subject of back-compat here: my preference is to minimize renaming public names 
(methods, classes, interfaces, etc) unless there is a pretty good reason. 
Otherwise it creates noise for users upgrading, and as we know, the name that 
sames appropriate today may seem weird tomorrow. In the case of deleting this 
BoundsChecker class it makes sense to me to do that since it's a real 
simplification.
   
   In contrast, I'd rather not be renaming public methods like 
`NeighborQueue.topScore` to `NeighborQueue.topNodeScore` just for the sake of 
perceived improvement in clarity because it forces consumers to adapt when 
upgrading.



##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -47,63 +42,30 @@ public float convertToScore(float similarity) {
   DOT_PRODUCT {
     @Override
     public float compare(float[] v1, float[] v2) {
-      return dotProduct(v1, v2);
-    }
-
-    @Override
-    public float convertToScore(float similarity) {
-      return (1 + similarity) / 2;
+      return (1 + dotProduct(v1, v2)) / 2;
     }
   },
 
   /**
    * Cosine similarity. NOTE: the preferred way to perform cosine similarity 
is to normalize all
    * vectors to unit length, and instead use {@link 
VectorSimilarityFunction#DOT_PRODUCT}. You
    * should only use this function if you need to preserve the original 
vectors and cannot normalize
-   * them in advance.
+   * them in advance. The similarity score is normalised to assure it is 
positive.
    */
   COSINE {
     @Override
     public float compare(float[] v1, float[] v2) {
-      return cosine(v1, v2);
-    }
-
-    @Override
-    public float convertToScore(float similarity) {
-      return (1 + similarity) / 2;
+      return (1 + cosine(v1, v2)) / 2;
     }
   };
 
   /**
-   * If true, the scores associated with vector comparisons are nonnegative 
and in reverse order;
-   * that is, lower scores represent more similar vectors. Otherwise, if 
false, higher scores
-   * represent more similar vectors, and scores may be negative or positive.
-   */
-  public final boolean reversed;
-
-  VectorSimilarityFunction(boolean reversed) {
-    this.reversed = reversed;
-  }
-
-  VectorSimilarityFunction() {
-    reversed = false;
-  }
-
-  /**
-   * Calculates a similarity score between the two vectors with a specified 
function.
+   * Calculates a similarity score between the two vectors with a specified 
function. Higher
+   * similarity scores correspond to closer vectors.
    *
    * @param v1 a vector
    * @param v2 another vector, of the same dimension
    * @return the value of the similarity function applied to the two vectors
    */
   public abstract float compare(float[] v1, float[] v2);

Review Comment:
   Let's please not rename the public interface methods unless there is some 
very good reason. As for the private variables, I don't care as much - we're 
free to rename for whatever the current perceived notion of clarity is, but I 
think we should be more careful about renaming public stuff. Or let's have a 
separate issue where we can debate that, whether if we do rename we should do 
it only on 10.x/main, and then just keep this PR about the already somewhat 
controversial idea of moving the scalar division / normalization into this hot 
spot.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] msokolov commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

Reply via email to