Hello Doron (and all the others who read here):), thank you for your effort and your time. I really appreciate it. :)
I understand why normalisation is done in general. Mainly, to normalise the bias of oversized documents. In the literature I have read so far, there is usually a high effort on document normalisation (simply because they may be very different in size), and usually no normalisation on queries (simply because they are usually only a few words anyway). For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with norm_q : sqrt(sum_t((tf_q*idf_t)^2)) which is also called cosine normalisation. This is a technique that is rather comprehensive and usually used for docuemnts only(!) in all systems I have seen so far. For the documents Lucene employs its norm_d_t which is explained as: norm_d_t : square root of number of tokens in d in the same field as t basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here... The paper you provided uses document normalisation in the following way: norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d)) I am not sure how this relates to norm_d_t. Did I misunderstood the Lucene formula or do I misinterpret something? I enclosed the formula as a graphic (from LaTeX) so you can have a look should this be the case here... I will post the other questions separately since they actually also relate to the new Lucene scoring algoritm (they have not changed). Thank you for your time again :) Karl -------- Original-Nachricht -------- Datum: Mon, 11 Dec 2006 22:41:56 -0800 Von: Doron Cohen <[EMAIL PROTECTED]> An: java-user@lucene.apache.org Betreff: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed) > > Well it doesn't since there is not justification of why it is the > > way it is. Its like saying, here is that car with 5 weels... enjoy > driving. > > > > - I think the explanations there would also answer at least some of > your > > > questions. > > I hoped it would answer *some* of the questions... (not all) > > Let me try some more :-) > > > 2a) Why does Lucene normalise with Cosine Normalisation for the > > query? In a range of different IR system variations (as shown in > > Chisholm and Kolda's paper in the table on page 8) queries where not > > normalised at all. Is there a good reason or perhaps any empirical > > evidence that would support this decision? > > I think "why normalizing at all" is answered in > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm > ... "queryNorm(q) is a normalizing factor used to make scores between > queries comparable. This factor does not affect document ranking (since > all > ranked documents are multiplied by the same factor), but rather just > attempts to make scores from different queries (or even different indexes) > comparable. This is a search time factor computed by the Similarity in > effect at search time. The default computation in DefaultSimilarity is: > queryNorm(q) = 1/sqrt(sumOfSquaredWeights)" > > I think there were discussions on this normalization. Anyhow I am not the > expert to justify this. > > > 2b) What is the motivation for the normalisation imposed on the > > documents (norm_d_t) which I have not seen before in any other > > system. Again, does anybody have pointers to literature on this? > > If I understand what you are asking, the logic is that verrry long > documents could virtually contain the entire collection, and therefore > should be "punished" for being too long, otherwise they would have an > "unfair advantage" upon short documents. Google "search engine document > length normalization" for more on this, or see also > http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf > > > Any answer or partial answer on any of the questions would be > > greatly appreciated! > > I hope this (although partial) helps at all, > Doron -- "Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]