Thanks for your tips. My overall goal is to quickly implement 7 variants of vector space model using Lucene. You can find these variants in the updloaded file.
I am doing all these stuffs for a much broader goal: I am trying to recover traceability links from requirements to source code files. I treat every requirement as a query. In this problem, I would like to compare these collection of algorithms for their relevance. Grant Ingersoll-6 wrote: > > > On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote: > >> >> Thanks for the reply. Sorry if my explanation is not clear. Yes, you >> are >> correct the model is based on Salton's VSM. However, the >> calculation of the >> term weight and the doc norm is, in my opinion, different from >> Lucene. If >> you look at the table given in >> http://www.miislita.com/term-vector/term-vector-3.html, they >> calcuate the >> document norm based on the weight wi=tfi*idfi. I looked at the >> interfaces of >> Similarity and DefaultSimilairty class. I place it below: >> >> public float lengthNorm(String fieldName, int numTerms) { >> return (float)(1.0 / Math.sqrt(numTerms)); >> } >> >> You can see that this lengthNorm for a doc is quite different from >> that >> website norm calculation. > > The lengthNorm method is different from the IDF calculation. In the > Similarity class, that is handled by the idf() method. Length norm is > an attempt to address one of the limitations listed further down in > that paper: > "Long Documents: Very long documents make similarity measures > difficult (vectors with small dot products and high dimensionality)" > > > >> >> >> Similarly, the querynorm interface of DefaultSimilarity class is: >> >> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */ >> public float queryNorm(float sumOfSquaredWeights) { >> return (float)(1.0 / Math.sqrt(sumOfSquaredWeights)); >> } >> >> This is again different the website model. > > Query norm is an attempt to allow for comparison of scores across > queries, but I don't think one should do that anyway. > > >> >> >> I also have difficulities with tf interface of DefaultSimilarity: >> /** Implemented as <code>sqrt(freq)</code>. */ >> public float tf(float freq) { >> return (float)Math.sqrt(freq); >> } >> > > These are all callback methods from within the Scorer classes that > each Query uses. Have a look at TermScorer for how these things get > called. > > > Try this as an example: > > Setup a really simple index with 1 or 2 docs each with a few words. > Setup a simple Similarity class where you override all of these > methods to return 1 (or some simple default) > and then index your documents and do a few queries. > > Then, have a look at Searcher.explain() to see why a document scores > the way it does. Then, you can work to modify from there. > > Here's the bigger question: what is your ultimate goal here? Are you > just trying to understand Lucene at an academic/programming level or > do you have something you are trying to achieve in terms of relevance? > > -Grant > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf -- View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]