Hello *, I have been trying to find an *efficient *(in terms of performance) way to get the Cosine Similarity between two Lucene Documents.
I have seen that this can be done with: 1. Converting the document into a query and submitting the query, getting the results and their score. --TOO SLOW if you want this for all documents in a corpus. 2. MoreLikeThis class, but this is not what I really want. What I want is the following: I have 3 different fields(zones) in my index(corpus) for each document. Each zone has its own boost(weight). What I need is: get the distance of all pairs of documents in my index using the different term weights(from each field's boost). I other words I need to calculate the Similarity formula for all pairs of documents in the index. Does anyone have in mind any project or code that does this? It would take some time to develop this myself. thanks a lot in advance, -- Asterios Katsifodimos High Performance Computing systems Lab Department of Computer Science, University of Cyprus http://grid.ucy.ac.cy