mccullocht opened a new issue, #15789:
URL: https://github.com/apache/lucene/issues/15789

   ### Description
   
   I've been thinking about this for a while and have prototyped parts of it, 
I'm curious of others think this is worth the effort and complexity.
   
   OSQ produces an approximate distance between a query vector and a quantized 
doc vector in the index. Extend this to include a statistical confidence lower 
and upper bound on the true distance.
   
   There are a couple of possible applications that I can think of:
   1. "Wavelet"-style distance scoring.
   2. Statistical bounding for re-scoring.
   
   There are probably other applications wherever there's an ordered list of 
scored vectors.
   
   ## Proposed Mechanism
   
   WARNING: I used Gemini to explore the math around this idea. I have verified 
this works in practice on some sets of L2 normalized vectors but I'm not sure 
how well this generalizes.
   
   The main observation is that the comparison of a fixed unit vector and a 
random unit vector is a random variable where the mean is 0 and the standard 
deviation is $\frac{1}{\sqrt{D}}$ where $D$ is the number of dimensions. When 
you sum $D$ dimensions of a vector product, the standard deviation of the sum 
scales by $\sqrt{D}$, but because the vectors are normalized the total standard 
deviation ends up scaling as:
   
   $$\sigma \approx \sqrt{D} \cdot \lparen\frac{1}{\sqrt{D}} \cdot 
\frac{1}{\sqrt{D}}\rparen = \frac{1}{\sqrt{D}}$$
   
   We can combine this with a $Z$ score to adjust the confidence interval on 
the bound.
   
   Quantization error needs to be incorporated into the bound to reflect our 
level of confidence based on loss of precision. As a starting point we will use 
the l2 norm of the quantization residual:
   
   $$\lVert \epsilon \rVert = \lVert v - \bar{v} \rVert$$
   
   Where $v$ is the original vector and $\bar{v}$ is the de-quantized vector. 
This $\lVert \epsilon \rVert$ value can be stored as a single float term for 
each vector. The $\lVert \epsilon \rVert$ value can be combined to compute a 
hard bound on the error by multiplying it with the L2 norm of the other vector, 
e.g. for two vectors $q$ and $d$ the hard bound can be computed as $\lVert q 
\rVert \cdot \lVert \epsilon_d \rVert + \lVert d \rVert \cdot \lVert \epsilon_q 
\rVert$. Combined with the standard deviation and $Z$ score we get:
   
   $$Bound = Z \cdot \sigma = Z \cdot \frac{\lVert q \rVert \cdot \lVert 
\epsilon_d \rVert + \lVert d \rVert \cdot \lVert \epsilon_q \rVert}{\sqrt{D}}$$
   
   This formulation works for dot product distance computation; for L2 we need 
to multiply the value by 2 to account for our dot product derived formulation 
of L2 distance $\lVert q \rVert \cdot \lVert d \rVert - 2 \cdot \langle q,d 
\rangle$.
   
   In cases where we have a very high precision query (8 bits+) then the 
$\lVert \epsilon \rVert$ value will be very small and we can likely omit it 
without any significant loss of precision.
   
   As noted we will have to store $\lVert \epsilon \rVert$ as a scalar term on 
each vector, but for angular distance we will also need to store the L2 norm 
(e.g. $\lVert q \rVert$) which is computed but not stored today. This will make 
each vector 4-8 bytes larger in storage.
   
   ## Wavelet-style distance scoring.
   
   Extend `RandomVectorScorer` to add a new method:
   
   ```java
   public interface RandomVectorScorer {
       /**
        * Returns the score between the query and the provided node. If the 
score would be less than
        * minSimilarity the implementation may exit early with a negative 
sentinel score.
        *
        * @param node the node to score
        * @param minSimilarity the minimum similarity to consider
        * @return the computed score, or a negative sentinel score if the score 
would be less than
        *     minSimilarity
        */
       float score(int node, float minSimilarity);
   }
   ```
   
   When collecting results we may use this method and pass the `minSimilarity` 
from the result set as the bound -- for instance, we might use this in HNSW 
traversal since any candidate with lower similarity than the worst result would 
never be popped.
   
   An OSQ based scorer can implement this method by:
   1. Quantizing a primary vector the way we do today, then quantizing a second 
_residual_ vector from the difference between the original query and the 
de-quantized primary vector.
   2. Compute the score of the two primary vectors and an upper bound score. If 
the upper bound score is less than `minSimilarity` we can exit early.
   3. Adjust the primary query dot product by adding the dot product of the 
residual query and the primary query, compute a score, and return.
   
   I've tried this in exhaustive scoring cases on test data sets and the 
residual adjustment (3) triggers about 0.5% of the time over a set of 2M 
vectors with minimal loss in recall (<0.5%) with $Z = 1.96$ (95% confidence). 
Assuming 1-bit doc vectors I don't think this would have a large impact on HNSW 
traversal as we are dominated by memory latency and the fallback should trigger 
a lot more since the scored set correlates with the highest scores. Exhaustive 
retrieval cases will likely do better, as would any partition-based search 
strategy like IVF or SPANN.
   
   ## Statistical bounding for re-scoring
   
   If the user wants 100 results and I have bounds for each result, I can take 
the max lower bound of my 100 results and include any vectors whose _upper 
bound_ is greater than that value. This would make re-scoring more precise 
since any vector that is likely to produce a result that would make the top 100 
with a more precise score would be included.
   
   In Lucene this is likely to be very annoying to implement as I want heap 
items to be a 16 byte struct but the heap implementation only accepts 8 byte 
items (encoded as `long`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to