Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Karl Koch Tue, 12 Dec 2006 01:00:29 -0800

Hello Doron (and all the others who read here):),

thank you for your effort and your time. I really appreciate it. :)


I understand why normalisation is done in general. Mainly, to normalise the 
bias of oversized documents. In the literature I have read so far, there is 
usually a high effort on document normalisation (simply because they may be 
very different in size), and usually no normalisation on queries (simply 
because they are usually only a few words anyway). 

For this reason, I do not understand why Lucene (in version 1.2) normalises the 
query(!) with 

norm_q : sqrt(sum_t((tf_q*idf_t)^2)) 

which is also called cosine normalisation. This is a technique that is rather 
comprehensive and usually used for docuemnts only(!) in all systems I have seen 
so far.  For the documents Lucene employs its norm_d_t which is explained as:

norm_d_t : square root of number of tokens in d in the same field as t

basically just the square root of the number of unique terms in the document 
(since I do search over all fields always). I would have expected cosine 
normalisation here... 

The paper you provided uses document normalisation in the following way:

norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))

I am not sure how this relates to norm_d_t. Did I misunderstood the Lucene 
formula or do I misinterpret something? I enclosed the formula as a graphic 
(from LaTeX) so you can have a look should this be the case here...

I will post the other questions separately since they actually also relate to 
the new Lucene scoring algoritm (they have not changed). Thank you for your 
time again :)

Karl


-------- Original-Nachricht --------
Datum:  Mon, 11 Dec 2006 22:41:56 -0800
Von: Doron Cohen <[EMAIL PROTECTED]>
An: [email protected]
Betreff:  Re:  Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring 
formula needed)

> > Well it doesn't since there is not justification of why it is the
> > way it is. Its like saying, here is that car with 5 weels... enjoy
> driving.
> 
> > >  - I think the explanations there would also answer at least some of
> your
> > > questions.
> 
> I hoped it would answer *some* of the questions... (not all)
> 
> Let me try some more :-)
> 
> > 2a) Why does Lucene normalise with Cosine Normalisation for the
> > query? In a range of different IR system variations (as shown in
> > Chisholm and Kolda's paper in the table on page 8) queries where not
> > normalised at all. Is there a good reason or perhaps any empirical
> > evidence that would support this decision?
> 
> I think "why normalizing at all" is answered in
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
> ... "queryNorm(q) is a normalizing factor used to make scores between
> queries comparable. This factor does not affect document ranking (since
> all
> ranked documents are multiplied by the same factor), but rather just
> attempts to make scores from different queries (or even different indexes)
> comparable. This is a search time factor computed by the Similarity in
> effect at search time. The default computation in DefaultSimilarity is:
> queryNorm(q) = 1/sqrt(sumOfSquaredWeights)"
> 
> I think there were discussions on this normalization. Anyhow I am not the
> expert to justify this.
> 
> > 2b) What is the motivation for the normalisation imposed on the
> > documents (norm_d_t) which I have not seen before in any other
> > system. Again, does anybody have pointers to literature on this?
> 
> If I understand what you are asking, the logic is that verrry long
> documents could virtually contain the entire collection, and therefore
> should be "punished" for being too long, otherwise they would have an
> "unfair advantage" upon short documents. Google "search engine document
> length normalization" for more on this, or see also
> http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf
> 
> > Any answer or partial answer on any of the questions would be
> > greatly appreciated!
> 
> I hope this (although partial) helps at all,
> Doron
-- 
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Reply via email to