uschindler commented on code in PR #13076:
URL: https://github.com/apache/lucene/pull/13076#discussion_r1479835140
##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -94,6 +95,29 @@ public float compare(float[] v1, float[] v2) {
public float compare(byte[] v1, byte[] v2) {
return scaleMaxInnerProductScore(dotProduct(v1, v2));
}
+ },
+ /**
+ * Binary Hamming distance; Computes how many bits are different in two
bytes.
+ *
+ * <p>Only supported for bytes. To convert the distance to a similarity
score we normalize using 1
+ * / (1 + hammingDistance)
+ */
+ BINARY_HAMMING_DISTANCE {
+ @Override
+ public float compare(float[] v1, float[] v2) {
+ throw new UnsupportedOperationException(
+ BINARY_HAMMING_DISTANCE.name() + " is only supported for byte
vectors");
+ }
+
+ @Override
+ public float compare(byte[] v1, byte[] v2) {
+ return (1f / (1 + binaryHammingDistance(v1, v2)));
Review Comment:
This depends on vector length, is this intended? I would have expected to
have something like `dimensions * 8 / (1 + distance)`. I know, it is not
relevant for scoring purposes as it is a constant factor, but we have some
normalization on other functions, too.
##########
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##########
@@ -214,4 +214,19 @@ public static float[] checkFinite(float[] v) {
}
return v;
}
+
+ public static int binaryHammingDistance(byte[] a, byte[] b) {
+ int distance = 0, i = 0;
+ for (final int upperBound = a.length & ~(Long.BYTES - 1); i < upperBound;
i += Long.BYTES) {
+ distance +=
+ Long.bitCount(
+ ((long) BitUtil.VH_NATIVE_LONG.get(a, i) ^ (long)
BitUtil.VH_NATIVE_LONG.get(b, i))
+ & 0xFFFFFFFFFFFFFFFFL);
Review Comment:
remove the `& 0xFFFFFFFFFFFFFFFFL`, it's useless. See my previous comment
with the "final version":
https://github.com/apache/lucene/pull/13076#issuecomment-1928027541
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]