[ 
https://issues.apache.org/jira/browse/LUCENE-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433565#comment-17433565
 ] 

Julie Tibshirani commented on LUCENE-10191:
-------------------------------------------

This is helpful feedback. I'm also sensitive to the fact that the more 
complexity we add to a format, the harder it is for BWC and for future 
implementations.

Some background: I think supporting Euclidean distance is really important. 
With certain datasets, similarity is measured in terms of Euclidean distance 
(instead of cosine), and in these cases it's critical to use Euclidean to get 
sensible results. Cosine similarity is less critical, since we could ask users 
to normalize all vectors to unit length before indexing + searching, and use 
dot product. Personally I think cosine is valuable (more details in 
https://issues.apache.org/jira/browse/LUCENE-10146), but am very happy to 
discuss trade-offs. In general, supporting different vector functions is low 
complexity compared to the ANN data structure itself.
{quote}Instead, slower functions needing different representation should really 
be different codecs... And trying to support these functions the way it happens 
now is wrong to do and will lead to hairballs.
{quote}
To check I understand the idea – are you suggesting a separate format per ANN 
method, per similarity function?

> Optimize vector functions by precomputing magnitudes
> ----------------------------------------------------
>
>                 Key: LUCENE-10191
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10191
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> Both euclidean distance (L2 norm) and cosine similarity can be expressed in 
> terms of dot product and vector magnitudes:
>  * l2_norm(a, b) = ||a - b|| = sqrt(||a||^2 - 2(a . b) + ||b||^2)
>  * cosine(a, b) = a . b / ||a|| ||b||
> We could compute and store each vector's magnitude upfront while indexing, 
> and compute the query vector's magnitude once per query. Then we'd calculate 
> the distance using our (very optimized) dot product method, plus the 
> precomputed values.
> This is an exploratory issue: I haven't tested this out yet, so I'm not sure 
> how much it would help. I would at least expect it to help with cosine 
> similarity – several months ago we tried out similar ideas in Elasticsearch 
> and were able to get a nice boost in cosine performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to