Re: A simple Vector Space Model and TFIDF usage

Kamal Najib Thu, 02 Jul 2009 01:50:27 -0700

Hallo Amir,
So far i understand, you have two sets of documents, let we say set1 and set2. 
If you want to get the Similarity between the two sets documents you have to 
index the docs of one and schearch  each doc of the others as a query, then you 
can get the similarity of the two documents. So:
1. Index the docs of the set1.
2. for each doc-element from the set2 do:
   create a query that contains the content text of the doc-element.
   Search them in your indexed docs from set2
   And from the hits you will get, you can get the score of the Similarity     
between the doc-element and every hit.


Your diractory where your indexed docs are saved represents the vector space 
model you want to bild. If you want to see how lucene computes the score 
result, you can use the class explanation and similarity in lucene Api and you 
will see that lucene  deals with the documents and querys in the same way as a 
vector space model. In the class explanation you can see that lucene use the 
TF, IDF and DF to compute the result score.
Best regards.
Kamal.
Original Message:

Hi,
<br />It's my first experiment with Lucene. Please help me.
<br />I'm going to index a set of documents and create a feature vector for 
each of them. This vector contains all terms belong to the document that weight 
using TFIDF.
<br />After that I want to compute the cosine similarity between all documents 
and produce a doc-doc similarity matrix. My document set is large and it's 
important to have a scalable implementation.
<br />Would you please provide me a guideline or to-do list?
<br />Thank you and kind regards.
<br />
<br />
<br />
<br />      

--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: A simple Vector Space Model and TFIDF usage

Reply via email to