tveasey commented on PR #15903:
URL: https://github.com/apache/lucene/pull/15903#issuecomment-4187588175

   > This could be the most interesting direction (and MAY explain our low 
recall values on internal data sets)
   
   This is probably the first thing to try out for your problem case. For some 
data sets the uplift can be dramatic. This is usually the only reason we see 
bad accuracy with OSQ, although low dimension vectors are also less 
compressible. Note though that many general embedding models produce fairly 
normal components for which you get small benefits from this technique.
   
   If the internal cases are model based and you own the training process, 
there are very standard methods (such as [spreading 
losses](https://arxiv.org/pdf/1708.06320) or simulating quantisation with 
straight-through estimators) which give dramatic improvements in 
compressibility of vectors and should probably be introduced to the training 
pipeline.
   
   Another challenging case is CLIP style models, which suffer from a modality 
gap (between text queries and image documents for example) unless it is trained 
out. These pose additional challenges for quantisation, which would ideally be 
query distribution aware. These days CLIP models perform less well than VLMs in 
relevance for multimodal retrieval, so if your internal use cases use CLIP 
architecture these might be something to explore.
   
   > The spacing difference is small but the consequence is large
   
   Definitely non-uniform grids will retain more information about the original 
vectors. We can't really compete in this respect.
   
   However, the crux comes down to what are you trying to optimise for here. If 
it is purely recall for maximum compression then great. In this case ideally 
we'd consider the query distribution too, but this seems a reasonable first 
step. However, typically you care about recall vs latency. In this case I think 
we'd have to prove out that we can implement the distance operation 
competitively with integer arithmetic. 
   
   There are two other factors in the mix:
   1. Centring the data with respect to multiple centroids helps (a reasonable 
amount). We can account for the centroid contribution exactly in OSQ so we gain 
by having to quantise smaller magnitude vectors (PQ implementations often use a 
similar trick). We also (although not in the original blog) worked out an 
asymmetric centring approach. This means we can quantise the query fewer times 
in the course of a search. This is good for the recall latency trade off. We 
get the centres for free with IVF style indices but not HNSW. One can 
potentially get cheap approximate clustering from the HNSW graph. We've also 
thought HNSW may benefit from some clustering in the mix at construction time.
   2. You get massive wins from reranking. For example, for recall@10 reranking 
30 vectors often gives you 15-20 pt improvement in recall. This will often be 
the most important lever at your disposal. So, if you want to hit high recalls, 
working out how to maximise rerank performance is probably the most valuable 
thing. For example, jumping straight to loading float32 vectors for reranking 
is significantly suboptimal. Even just bfloat16 reranking is probably 
essentially as good, but has 2x memory throughput.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to