Re: Vector Space Model: New Similarity Implementation Issues

Dharmalingam Thu, 28 Feb 2008 12:56:07 -0800

Thanks for your tips. My overall goal is to quickly implement 7 variants of
vector space model using Lucene. You can find these variants in the
updloaded file.


I am doing all these stuffs for a much broader goal: I am trying to recover
traceability links from requirements to source code files. I treat every
requirement as a query. In this problem, I would like to compare these
collection of algorithms for their relevance.




Grant Ingersoll-6 wrote:
> 
> 
> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
> 
>>
>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you  
>> are
>> correct the model is based on  Salton's VSM. However, the  
>> calculation of the
>> term weight and the doc norm is, in my opinion, different from  
>> Lucene. If
>> you look at the table given in
>> http://www.miislita.com/term-vector/term-vector-3.html, they  
>> calcuate the
>> document norm based on the weight wi=tfi*idfi. I looked at the  
>> interfaces of
>> Similarity and DefaultSimilairty class. I place it below:
>>
>> public float lengthNorm(String fieldName, int numTerms) {
>>    return (float)(1.0 / Math.sqrt(numTerms));
>> }
>>
>> You can see that this lengthNorm for a doc is quite different from  
>> that
>> website norm calculation.
> 
> The lengthNorm method is different from the IDF calculation.  In the  
> Similarity class, that is handled by the idf() method.  Length norm is  
> an attempt to address one of the limitations listed further down in  
> that paper:
> "Long Documents: Very long documents make similarity measures  
> difficult (vectors with small dot products and high dimensionality)"
> 
> 
> 
>>
>>
>> Similarly, the querynorm interface of DefaultSimilarity class is:
>>
>> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>>  public float queryNorm(float sumOfSquaredWeights) {
>>    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>>  }
>>
>> This is again different the website model.
> 
> Query norm is an attempt to allow for comparison of scores across  
> queries, but I don't think one should do that anyway.
> 
> 
>>
>>
>> I also have difficulities with tf interface of DefaultSimilarity:
>> /** Implemented as <code>sqrt(freq)</code>. */
>>  public float tf(float freq) {
>>    return (float)Math.sqrt(freq);
>>  }
>>
> 
> These are all callback methods from within the Scorer classes that  
> each Query uses.  Have a look at TermScorer for how these things get  
> called.
> 
> 
> Try this as an example:
> 
> Setup a really simple index with 1 or 2 docs each with a few words.   
> Setup a simple Similarity class where you override all of these  
> methods to return 1 (or some simple default)
> and then index your documents and do a few queries.
> 
> Then, have a look at Searcher.explain() to see why a document scores  
> the way it does.  Then, you can work to modify from there.
> 
> Here's the bigger question:  what is your ultimate goal here?  Are you  
> just trying to understand Lucene at an academic/programming level or  
> do you have something you are trying to achieve in terms of relevance?
> 
> -Grant
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf 
-- 
View this message in context: 
http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Vector Space Model: New Similarity Implementation Issues

Reply via email to