Re: Vector Space Model: New Similarity Implementation Issues

Grant Ingersoll Thu, 28 Feb 2008 09:45:09 -0800


On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:

Thanks for the reply. Sorry if my explanation is not clear. Yes, youarecorrect the model is based on Salton's VSM. However, thecalculation of theterm weight and the doc norm is, in my opinion, different fromLucene. If
you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, theycalcuate thedocument norm based on the weight wi=tfi*idfi. I looked at theinterfaces of
Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
   return (float)(1.0 / Math.sqrt(numTerms));
}
You can see that this lengthNorm for a doc is quite different fromthat
website norm calculation.

The lengthNorm method is different from the IDF calculation. In theSimilarity class, that is handled by the idf() method. Length norm isan attempt to address one of the limitations listed further down inthat paper:"Long Documents: Very long documents make similarity measuresdifficult (vectors with small dot products and high dimensionality)"



Similarly, the querynorm interface of DefaultSimilarity class is:

/** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
 public float queryNorm(float sumOfSquaredWeights) {
   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
 }

This is again different the website model.

Query norm is an attempt to allow for comparison of scores acrossqueries, but I don't think one should do that anyway.



I also have difficulities with tf interface of DefaultSimilarity:
/** Implemented as <code>sqrt(freq)</code>. */
 public float tf(float freq) {
   return (float)Math.sqrt(freq);
 }

These are all callback methods from within the Scorer classes thateach Query uses. Have a look at TermScorer for how these things getcalled.



Try this as an example:

Setup a really simple index with 1 or 2 docs each with a few words.Setup a simple Similarity class where you override all of thesemethods to return 1 (or some simple default)

and then index your documents and do a few queries.

Then, have a look at Searcher.explain() to see why a document scoresthe way it does. Then, you can work to modify from there.

Here's the bigger question: what is your ultimate goal here? Are youjust trying to understand Lucene at an academic/programming level ordo you have something you are trying to achieve in terms of relevance?


-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Vector Space Model: New Similarity Implementation Issues

Reply via email to