Thanks, Tomoko, I think you're right. Documentation and a helper function should do it. I see now that ann-benchmarks toolkit can be made to do the work external to the indexing process.
On Wed, Nov 25, 2020, 6:52 PM Tomoko Uchida <tomoko.uchida.1...@gmail.com> wrote: > Hi Mike, > I'm looking forward to that vector search is available at 9.0, thanks for > your hard work on it. > > > Alternatively we could simply expect users to > > perform such normalization, and throw an error if vectors intended for > > comparison using dot product (which is specified when adding a value) > > are not unit-length. > > For simplicity, we could assume normalized vectors as inputs and just > document it - without any checks? Meanwhile some utility functions (e.g., > o.a.l.util.VectorUtil) for it could be helpful. > > Tomoko > > > 2020年11月26日(木) 6:23 Michael Sokolov <msoko...@gmail.com>: > >> I have been working on getting benchmarks working on the GloVe public >> data set and spent a while chasing down a bug with VectorValues.search >> that turned out to be a bug with the data (sort of)! When comparing >> vectors using an angular (dot product) measure, one has to normalize >> by the vectors' lengths. Given that the only purpose of such vectors >> is to compare them using dot-product, it would be sensible to >> normalize them *in advance* to unit length, rather than doing so for >> every comparison, yet this is not how this dataset at least is >> distributed on the internet, and widely-referenced benchmarking >> software such as ann-benchmarks assumes that code will handle such >> details internally. >> >> I'm trying to see how we should handle this use case. We could provide >> a convenience function for normalizing while indexing. But should we? >> Would it happen when creating an IndexableField? When flushing? It's a >> little strange if you index a vector, and then retrieve it and its >> value is different! Alternatively we could simply expect users to >> perform such normalization, and throw an error if vectors intended for >> comparison using dot product (which is specified when adding a value) >> are not unit-length. But then again this is a somewhat costly >> operation that is only a safety measure, and users who already >> normalized their vectors would pay the cost needlessly. >> >> For now, I'm doing nothing, but I wonder if we could offer users some >> help here. >> >> -Mike >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>