Computing the similarity of documents

Fotis P Thu, 21 May 2015 07:51:13 -0700

Hello everyone,

My task at hand is to compute the pairwise cosine similarity between a list
of documents.


I first index all the documents with DOCS_AND_FREQS option, then I
construct a query from every term of a document:

Query query =  parser.parse(document);

making sure to use the same analyzer in indexing and searching time.

I have also implemented my own similarity class so that I exclude coord(),
slopyfreq() etc. My implementation is here: http://pastebin.com/MArCs3ff

I still dont get the correct results however. Scoring results do make sense
from a search perspective, they are not however the values that I am
looking for.

I am bit lost as to what I should change to fine-tune the behaviour exactly
as I want it. The Lucene scoring formula for example confuses me with this
part: Σ tf(t in d)
<http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html#formula_tf>
This means that it only takes into account terms that exist in the query
(in my case a document) . Terms that exist in the other document but not in
the query do not alter the results, correct?

I hope what I am asking for is clear enough. If you need some more
information from me please ask.

Thank you in advance,

Fotios

Computing the similarity of documents

Reply via email to