tveasey commented on PR #15903: URL: https://github.com/apache/lucene/pull/15903#issuecomment-4187588175
> This could be the most interesting direction (and MAY explain our low recall values on internal data sets) This is probably the first thing to try out for your problem case. For some data sets the uplift can be dramatic. This is usually the only reason we see bad accuracy with OSQ, although low dimension vectors are also less compressible. Note though that many general embedding models produce fairly normal components for which you get small benefits from this technique. If the internal cases are model based and you own the training process, there are very standard methods (such as [spreading losses](https://arxiv.org/pdf/1708.06320) or simulating quantisation with straight-through estimators) which give dramatic improvements in compressibility of vectors and should probably be introduced to the training pipeline. Another challenging case is CLIP style models, which suffer from a modality gap (between text queries and image documents for example) unless it is trained out. These pose additional challenges for quantisation, which would ideally be query distribution aware. These days CLIP models perform less well than VLMs in relevance for multimodal retrieval, so if your internal use cases use CLIP architecture these might be something to explore. > The spacing difference is small but the consequence is large Definitely non-uniform grids will retain more information about the original vectors. We can't really compete in this respect. However, the crux comes down to what are you trying to optimise for here. If it is purely recall for maximum compression then great. In this case ideally we'd consider the query distribution too, but this seems a reasonable first step. However, typically you care about recall vs latency. In this case I think we'd have to prove out that we can implement the distance operation competitively with integer arithmetic. There are two other factors in the mix: 1. Centring the data with respect to multiple centroids helps (a reasonable amount). We can account for the centroid contribution exactly in OSQ so we gain by having to quantise smaller magnitude vectors (PQ implementations often use a similar trick). We also (although not in the original blog) worked out an asymmetric centring approach. This means we can quantise the query fewer times in the course of a search. This is good for the recall latency trade off. We get the centres for free with IVF style indices but not HNSW. One can potentially get cheap approximate clustering from the HNSW graph. We've also thought HNSW may benefit from some clustering in the mix at construction time. 2. You get massive wins from reranking. For example, for recall@10 reranking 30 vectors often gives you 15-20 pt improvement in recall. This will often be the most important lever at your disposal. So, if you want to hit high recalls, working out how to maximise rerank performance is probably the most valuable thing. For example, jumping straight to loading float32 vectors for reranking is significantly suboptimal. Even just bfloat16 reranking is probably essentially as good, but has 2x memory throughput. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
