Thanks, Tomoko, I think you're right. Documentation and a helper function
should do it. I see now that ann-benchmarks toolkit can be made to do the
work external to the indexing process.

On Wed, Nov 25, 2020, 6:52 PM Tomoko Uchida <tomoko.uchida.1...@gmail.com>
wrote:

> Hi Mike,
> I'm looking forward to that vector search is available at 9.0, thanks for
> your hard work on it.
>
> > Alternatively we could simply expect users to
> > perform such normalization, and throw an error if vectors intended for
> > comparison using dot product (which is specified when adding a value)
> > are not unit-length.
>
> For simplicity, we could assume normalized vectors as inputs and just
> document it - without any checks? Meanwhile some utility functions (e.g.,
> o.a.l.util.VectorUtil) for it could be helpful.
>
> Tomoko
>
>
> 2020年11月26日(木) 6:23 Michael Sokolov <msoko...@gmail.com>:
>
>> I have been working on getting benchmarks working on the GloVe public
>> data set and spent a while chasing down a bug with VectorValues.search
>> that turned out to be a bug with the data (sort of)! When comparing
>> vectors using an angular (dot product) measure, one has to normalize
>> by the vectors' lengths. Given that the only purpose of such vectors
>> is to compare them using dot-product, it would be sensible to
>> normalize them *in advance* to unit length, rather than doing so for
>> every comparison, yet this is not how this dataset at least is
>> distributed on the internet, and widely-referenced benchmarking
>> software such as ann-benchmarks assumes that code will handle such
>> details internally.
>>
>> I'm trying to see how we should handle this use case. We could provide
>> a convenience function for normalizing while indexing. But should we?
>> Would it happen when creating an IndexableField? When flushing? It's a
>> little strange if you index a vector, and then retrieve it and its
>> value is different! Alternatively we could simply expect users to
>> perform such normalization, and throw an error if vectors intended for
>> comparison using dot product (which is specified when adding a value)
>> are not unit-length. But then again this is a somewhat costly
>> operation that is only a safety measure, and users who already
>> normalized their vectors would pay the cost needlessly.
>>
>> For now, I'm doing nothing, but I wonder if we could offer users some
>> help here.
>>
>> -Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Reply via email to