Hello everyone, My task at hand is to compute the pairwise cosine similarity between a list of documents.
I first index all the documents with DOCS_AND_FREQS option, then I construct a query from every term of a document: Query query = parser.parse(document); making sure to use the same analyzer in indexing and searching time. I have also implemented my own similarity class so that I exclude coord(), slopyfreq() etc. My implementation is here: http://pastebin.com/MArCs3ff I still dont get the correct results however. Scoring results do make sense from a search perspective, they are not however the values that I am looking for. I am bit lost as to what I should change to fine-tune the behaviour exactly as I want it. The Lucene scoring formula for example confuses me with this part: Σ tf(t in d) <http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html#formula_tf> This means that it only takes into account terms that exist in the query (in my case a document) . Terms that exist in the other document but not in the query do not alter the results, correct? I hope what I am asking for is clear enough. If you need some more information from me please ask. Thank you in advance, Fotios