On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:


Thanks for the reply. Sorry if my explanation is not clear. Yes, you are correct the model is based on Salton's VSM. However, the calculation of the term weight and the doc norm is, in my opinion, different from Lucene. If
you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, they calcuate the document norm based on the weight wi=tfi*idfi. I looked at the interfaces of
Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
   return (float)(1.0 / Math.sqrt(numTerms));
}

You can see that this lengthNorm for a doc is quite different from that
website norm calculation.

The lengthNorm method is different from the IDF calculation. In the Similarity class, that is handled by the idf() method. Length norm is an attempt to address one of the limitations listed further down in that paper: "Long Documents: Very long documents make similarity measures difficult (vectors with small dot products and high dimensionality)"





Similarly, the querynorm interface of DefaultSimilarity class is:

/** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
 public float queryNorm(float sumOfSquaredWeights) {
   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
 }

This is again different the website model.

Query norm is an attempt to allow for comparison of scores across queries, but I don't think one should do that anyway.




I also have difficulities with tf interface of DefaultSimilarity:
/** Implemented as <code>sqrt(freq)</code>. */
 public float tf(float freq) {
   return (float)Math.sqrt(freq);
 }


These are all callback methods from within the Scorer classes that each Query uses. Have a look at TermScorer for how these things get called.


Try this as an example:

Setup a really simple index with 1 or 2 docs each with a few words. Setup a simple Similarity class where you override all of these methods to return 1 (or some simple default)
and then index your documents and do a few queries.

Then, have a look at Searcher.explain() to see why a document scores the way it does. Then, you can work to modify from there.

Here's the bigger question: what is your ultimate goal here? Are you just trying to understand Lucene at an academic/programming level or do you have something you are trying to achieve in terms of relevance?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to