(1) The FAQ states the following: "For the record, Lucene's scoring algorithm is, roughly:
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d where: score_d : score for document d sum_t : sum for all terms t tf_q : the square root of the frequency of t in the query tf_d : the square root of the frequency of t in d idf_t : log(numDocs/docFreq_t+1) + 1.0 numDocs : number of documents in index docFreq_t : number of documents containing t norm_q : sqrt(sum_t((tf_q*idf_t)^2)) norm_d_t : square root of number of tokens in d in the same field as t boost_t : the user-specified boost for term t coord_q_d : number of terms in both query and document / number of terms in query" Is either of the expressions below the correct parenthesization of the expression above? If not, what is? score_d = sum_t(tf_q * (idf_t / norm_q) * tf_d * (idf_t / norm_d_t) * boost_t) * coord_q_d score_d = sum_t((tf_q * idf_t) / (norm_q * tf_d * idf_t) / (norm_d_t * boost_t)) * coord_q_d (2) I'm trying to make sure that I have a handle on the semantics of BooleanQuery, especially as they relate to the scoring mechanism. I would appreciate it if someone would correct any misapprehensions in the following descriptions. * BooleanQuery.add(query, false, false) is equivalent to Boolean OR. Only one such query must be satisfied in a given document D (via the appearance of the associated term(s)/phrase(s) in D) in order for the score for D to be > 0. * BooleanQuery.add(query, true, false) is equivalent to Boolean AND. All such queries must be satisfied (as above) in D in order for score(D) to be > 0. * BooleanQuery.add(query, false, true) is equivalent to Boolean NAND. All such queries must be *not* satisfied in D in order for score(D) to be > 0. If these, and the semantics for "required" and "prohibited" (in BooleanQuery.add()), are accurate, then the semantics seem rather odd to me, so I'm hoping that someone will tell me that I'm wrong. :) In particular, it seems to me that if you create a BooleanQuery and add a single TermQuery tq to it with add(tq, false, false) then, according to the semantics of "required" and "prohibited", *any* document will match the query...which clearly doesn't make sense. (3) Somewhat unrelated question: what are the semantics and purpose of FilteredTermEnum.difference()? (I see where and how it's used in the source but I don't understand the motivation.) (4) I'm still somewhat puzzled by MultiTermQuery. I believe that Lucene essentially uses the vector model to identify (and quantify) query-document matching. A vector model query consists of one or more terms, so I would expect MultiTermQuery to be a common query type. However, the documentation doesn't make it clear how one binds several terms together into a MultiTermQuery, and in particular I don't see how one could separately set the boost for different terms in a MultiTermQuery. In my current prototype system, I've been using BooleanQuery to bind my terms together into a single query, and I'm not sure that the semantics for BooleanQuery is what I want (hence question (2) above). Could someone please explain what MultiTermQuery is for, how it should be used, etc.? Thanks-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>