Hi,

I'm not sure anyone can fully address all your questions, but I thought I would point you at http://lucene.apache.org/java/docs/ scoring.html if you haven't already started there. It has some details on scoring, as well as pointers into lower level details.

Some comments are also inline below, which are just a few thoughts on some of your questions, but nothing in-depth like I think you are looking for. I am not able to answer in more detail at this time, but can give you some pointers on where to look.

Hope it helps.

-Grant

On Apr 2, 2007, at 3:35 PM, [EMAIL PROTECTED] wrote:

I’ve been working on ranking/scoring issues for full text search for
years. When I try to implement different ranking inside the Lucene engine,
I face some problems. I list some of my experience and questions here.

The major task I want to implement inside Lucene is different
ranking/scoring algorithms. I may not find the correct source of
information, but I really cannot find a detailed document describing the relations among ranking/scoring related classes in Lucene on web. “Lucene in Action” concerns mostly about applications level usage but not these
lower-level APIs.

(1) The first thing I tried is the abstract class Scorer. The description
for this class is:
/** Expert: Implements scoring for a class of queries. */

If you looked IndexSearcher class, one possible search process is
implemented inside
public TopDocs search(Query query, Filter filter, final int nDocs). If you look in detail how it is implemented, you will find out it first acquires
such a scorer instance by:
Scorer scorer = query.weight(this).scorer(reader);

The magic happens inside the method call scorer.score(new HitCollector());
What the score method does is something similar to an Iterator. The
scorer.score continuously call scorer.next() to acquire next qualified
document (I guess the qualified documents inside this iterator is from
BitSet operations from the given query. I did not looked into the detail implementation of that), and calls scorer.score() to get the score for the
current document that the iterator pointed at.

What the HitCollector does is merely simple. It only implement a method
called collect(int doc, float score). This method is called every time
when a new document’s score are calculated. In the IndexSearcher.search
method, the documents and their scores are sent to a PriorityQueue and
ranked according to the scores.

(2) My first plan is to modify the scorer.score()method. Unfortunately, I found this is extremely complex. score() is an abstract method which is implemented inside its subclasses like Boolean Scorer, Conjunction Scorer, Phrase Score, … Since I do not need to consider the complex Boolean query syntax (in my experiments, query is defined as a list of terms connected by disjunctions), I implement the score method inside Scorer rather inside
sub classes.

What I did is every time when Scorer.score are called, I pass the current
document number via doc(), and read out the term frequencies from
IndexReader. getTermFreqVector(int docNumber, String field)  method. I
found this operation is super slow. The major cost is spent on the method
getTermFreqVector.

Term Vectors are stored differently from TermDocs, so I am not surprised that they are slower. There probably could be some optimizations made to their storage, but I would guess that most people don't use Term Vecs all that much, so no one has looked too closely at where optimizations could be made. They almost certainly are not using them to loop over all the docs in the system. Most scenarios, I believe are for highlighting results on a doc-by-doc basis or for relevance feedback/more like this. Think of Term Vectors as an after the fact addition to Lucene for convenience purposes, not for scoring/performance.



(3) Later I notice in the implementation in TermScorer, there is a
function call
IndexReader.termDocs.read(docs, freqs);    // refill buffer
And I read the comments for this function in IndexReader, it says

/** Attempts to read multiple entries from the enumeration, up to length of * <i>docs</i>. Document numbers are stored in <i>docs</i>, and term * frequencies are stored in <i>freqs</i>. The <i>freqs</i> array must
be as
   * long as the <i>docs</i> array.
   *
* <p>Returns the number of entries read. Zero is only returned when the
   * stream has been exhausted.  */

So different from IndexReader.getTermFreqVector, which read out term-freq vectors for a document, this function read doc-freq vectors for a term. I find this method call is extremely faster. At least two magnitudes faster than getTermFreqVector, if I want to get all term-freqs for given terms
for all docs. I do not know the reason why there is such difference, I
cannot find a document describing the implementation difference and
purpose among these two.

(4) About extending Lucene’s scoring function.
If we want to implement any arbitrary ranking/scoring for term- frequency
based algorithms, we need the scoring fits the following framework

Any term-frequency scoring can be expressed in the following ways (let’s simplify that there is only one field so that we can ignore boost factor
at this time):

Sigma (t in q) [TermWeight(t in d)*TermWeight(t in q) / lengthNorm (d) /
lengthNorm(q)]

Since lengthNorm(q) is the same for each document, it will not affect
ranking, we just ignore it. We further separate term weight into three
parts, term weight related to document, term weight related to query, term weight related to corpus (not related to document or query) and document
length norm.

Sigma (t in q) [TW(t in d)*TW(t in q) * TW (t) / lengthNorm(d)]

We can notice this matches Lucene’ ranking function

Sigma (t in q) [tf(t in d)*1*idf (t) / lengthNorm(d)]. To save
computational cost, lengthNorm(d) is pre-calculated when indexing the
corpus. So Lucene’ lengthNorm(d) does not involve any corpus statistics into calculation, it is merely the # of terms inside this document. On the other hand, it treats the term weight in query is the same as 1. That is Lucene does not differentiate terms important inside query. If we want to
emphasis a term twice we need to put two terms inside the query. For
example, instead of search “boat” we put “boat boat”, do I understand
correctly?

So, generally, if we can update the lengthNorm(d) inside the indexing code
or via post-processing after all documents are indexed, Lucene can
implement any arbitrary ranking function such as BM25. A pity is that it
does not directly support query term weight in the query language.


We like patches and will at least consider and discuss most any patch that is well thought out, backward compatible and tested and where the author makes a good case for it.

(5) My final big question would be how Lucene really implement
ranking/scoring. We could notice there are two possible strategies. Each
of them will result in different flexibility if we need modify current
ranking algorithms.

The first strategy is Lucene generate document scores in a document by
document manner. In Scorer.score, we notice the framework is each time the
scorer meet a new document, Lucene will generate a score for this
document. This framework is simple and intuitive. And all we discussed in
(4) will fit this framework. When you processing the current document,
lengthNorm(d) can be read, even if TW(t in d) has relation to
lengthNorm(d) we can calculate that accordingly. Unfortunately, I cannot find any low-level code could be thinked related to this implementation
strategy. This makes me think the Scorer.score method is not the real
place Lucene implement its ranking. I am pretty confused about this part.
Who can help me with this?


You might take a look at how Query/Weight/Scorers are implemented. For instance, I just added a BoostingTermQuery to Java version that implements this trifecta.

The second strategy is Lucene generate document scores in a term by term manner. For each term inside the query, Lucene calls termDocs.read (docs, freqs) to accumulate scores over the specific term dimension over all the
documents.  Under this framework (which I feel is current Lucene’s
implementation), what we discussed in (4) is not held. We need extend
current Lucene’s tf(float freq) function to tf(float freq, float norm), so
that arbitrary ranking could be implemented.

There is some discussion of flexible indexing approaches on Lucene Java, which also, to me anyway, implies flexible scoring. You might search that archive for "flexible indexing" to see if anything strikes you.


(6) I have implemented a search module outside Lucene IndexSearcher which
could implement any arbitrary ranking over vector query (a list of
disjunctive terms). The term-by-term implementation is much faster than my previous document-by-document implementation. But I currently still cannot encode the ranking module under Lucene’ Boolean engines, that is only rank
the documents retrieved (only rank the documents satisfy the condition
specified by the query).

__________________________________________________________________

I do not know whether I described my points clear, it is complex and hard to write in a short plain text message. Maybe I should post a PDF version tech-report online so that the problem is stated clearer. I am not sure my
understanding is correct. Thanks for any help and comments.


Xiangyu Jin
Department of Computer Science
University of Virginia
Charlottesville, VA 22903




--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ


Reply via email to