Qs re: document scoring and semantics

Joshua O'Madadhain Sat, 16 Feb 2002 23:35:55 -0800

(1) The FAQ states the following:

"For the record, Lucene's scoring algorithm is, roughly:


  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
 
where:
  score_d   : score for document d
  sum_t     : sum for all terms t
  tf_q      : the square root of the frequency of t in the query
  tf_d      : the square root of the frequency of t in d
  idf_t     : log(numDocs/docFreq_t+1) + 1.0
  numDocs   : number of documents in index
  docFreq_t : number of documents containing t
  norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
  norm_d_t  : square root of number of tokens in d in the same field as t
  boost_t    : the user-specified boost for term t
  coord_q_d  : number of terms in both query and document / number of
terms in query"

Is either of the expressions below the correct parenthesization of the
expression above?  If not, what is?

score_d = sum_t(tf_q * (idf_t / norm_q) * tf_d * (idf_t / norm_d_t) *
boost_t) * coord_q_d

score_d = sum_t((tf_q * idf_t) / (norm_q * tf_d * idf_t) / (norm_d_t *
boost_t)) * coord_q_d


(2) I'm trying to make sure that I have a handle on the semantics of
BooleanQuery, especially as they relate to the scoring mechanism.  I would
appreciate it if someone would correct any misapprehensions in the
following descriptions.

* BooleanQuery.add(query, false, false) is equivalent to Boolean OR.  
Only one such query must be satisfied in a given document D (via the
appearance of the associated term(s)/phrase(s) in D) in order for the
score for D to be > 0.

* BooleanQuery.add(query, true, false) is equivalent to Boolean AND.  All
such queries must be satisfied (as above) in D in order for score(D) to
be > 0.

* BooleanQuery.add(query, false, true) is equivalent to Boolean NAND.  
All such queries must be *not* satisfied in D in order for score(D) to be
> 0.

If these, and the semantics for "required" and "prohibited" (in
BooleanQuery.add()), are accurate, then the semantics seem rather odd to
me, so I'm hoping that someone will tell me that I'm wrong.  :) In
particular, it seems to me that if you create a BooleanQuery and add a
single TermQuery tq to it with add(tq, false, false) then, according to
the semantics of "required" and "prohibited", *any* document will match
the query...which clearly doesn't make sense.


(3) Somewhat unrelated question: what are the semantics and purpose of
FilteredTermEnum.difference()?  (I see where and how it's used in the
source but I don't understand the motivation.)


(4) I'm still somewhat puzzled by MultiTermQuery.  I believe that Lucene
essentially uses the vector model to identify (and quantify)
query-document matching.  A vector model query consists of one or more
terms, so I would expect MultiTermQuery to be a common query
type.  However, the documentation doesn't make it clear how one binds
several terms together into a MultiTermQuery, and in particular I don't
see how one could separately set the boost for different terms in a
MultiTermQuery.  In my current prototype system, I've been using
BooleanQuery to bind my terms together into a single query, and I'm not
sure that the semantics for BooleanQuery is what I want (hence question
(2) above).  Could someone please explain what MultiTermQuery is for,  
how it should be used, etc.?

Thanks--

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
    Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Qs re: document scoring and semantics

Reply via email to