mikemccand commented on code in PR #16092:
URL: https://github.com/apache/lucene/pull/16092#discussion_r3328870432


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104ScalarQuantizedVectorsReader.java:
##########
@@ -197,6 +213,14 @@ public RandomVectorScorer getRandomVectorScorer(String 
field, float[] target) th
     if (fi == null) {
       return null;
     }
+    // Rotate the query vector to match the rotated stored vectors.
+    float[] scoringTarget = target;
+    if (isRotationEnabled(field) && target != null) {
+      HadamardRotation rotation = rotationFor(field, fi.dimension);
+      float[] rotated = new float[target.length];
+      rotation.rotate(target, rotated);

Review Comment:
   > `rotate` is a `O(d log d)` operation here.
   
   Wait, this is not so costly?  It's like `log(d)` dot products for each 
segment, when we do many dot products per vector per (large-ish) segment.  It's 
fixed cost per segment, and amortizes to zero as the index gets bigger.
   
   > To see the significance of the performance impact, you will need many 
vectors spread over many segments (which is very common, I mean 10s of millions 
spread over 30-50+ segments).
   
   Actually, this should be the best case scenario -- the larger the 
index/segment, the more dot products you'll need to crawl its graph, reducing 
this small added fixed per-segment cost of `log(d)` dot products?
   
   I don't really understand the fuss.  The added cost (amortizes to zero) 
seems reasonable if we really do see such recall lift.
   
   @shubhamvishu you could create a tiny tool in luceneutil to just do a 
one-time rotation of `.vec` file?  Then we all could quickly evaluate impact on 
our own vector corpora, even without this PR, for starters.
   
   But anyway +1 to make a single global rotation matrix for each vector field, 
not because I'm worried about the performance impact of per-segment query 
vector rotation, but for a different reason: because I think this would then 
unlock data blind quantization as long as vectors are L2 unit-sphere 
normalized.  Then (once your vectors are truly isotropic), optimal quantization 
is solely a function of `int D` (dimension) and not the actual vectors, which 
would be amazing:
   
     * Faster indexing since we wouldn't need the extra pass through the 
vectors to get histogram of values across all dimensions / centroid for the 
segment (to tune the quantizer, which is not data blind today)
     * Merging could then bulk-copy vectors in quantized form, since their 
quantized encoding will not change
     * We could fully drop full precision vectors (!!) since they are no longer 
needed for tuning quantizer.  Of course user should still have option to keep 
them, e.g. maybe they want to re-rank.
   
   But let's keep this PR focused on pre-conditioning!  Data blind quantizer, 
options to keep full precision vectors or not, can come later...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to