Re: Deeper Ranking Issues in Lucene

Grant Ingersoll Mon, 02 Apr 2007 13:24:32 -0700

Hi,

I'm not sure anyone can fully address all your questions, but Ithought I would point you at http://lucene.apache.org/java/docs/scoring.html if you haven't already started there. It has somedetails on scoring, as well as pointers into lower level details.

Some comments are also inline below, which are just a few thoughts onsome of your questions, but nothing in-depth like I think you arelooking for. I am not able to answer in more detail at this time,but can give you some pointers on where to look.


Hope it helps.

-Grant

On Apr 2, 2007, at 3:35 PM, [EMAIL PROTECTED] wrote:

I’ve been working on ranking/scoring issues for full text search for
years. When I try to implement different ranking inside the Luceneengine,
I face some problems. I list some of my experience and questions here.

The major task I want to implement inside Lucene is different
ranking/scoring algorithms. I may not find the correct source of
information, but I really cannot find a detailed documentdescribing therelations among ranking/scoring related classes in Lucene on web.“Lucenein Action” concerns mostly about applications level usage but notthese
lower-level APIs.
(1) The first thing I tried is the abstract class Scorer. Thedescription
for this class is:
/** Expert: Implements scoring for a class of queries. */

If you looked IndexSearcher class, one possible search process is
implemented inside
public TopDocs search(Query query, Filter filter, final int nDocs).If youlook in detail how it is implemented, you will find out it firstacquires
such a scorer instance by:
Scorer scorer = query.weight(this).scorer(reader);
The magic happens inside the method call scorer.score(newHitCollector());
What the score method does is something similar to an Iterator. The
scorer.score continuously call scorer.next() to acquire next qualified
document (I guess the qualified documents inside this iterator is from
BitSet operations from the given query. I did not looked into thedetailimplementation of that), and calls scorer.score() to get the scorefor the
current document that the iterator pointed at.
What the HitCollector does is merely simple. It only implement amethod
called collect(int doc, float score). This method is called every time
when a new document’s score are calculated. In theIndexSearcher.search
method, the documents and their scores are sent to a PriorityQueue and
ranked according to the scores.
(2) My first plan is to modify the scorer.score()method.Unfortunately, Ifound this is extremely complex. score() is an abstract methodwhich isimplemented inside its subclasses like Boolean Scorer, ConjunctionScorer,Phrase Score, … Since I do not need to consider the complex Booleanquerysyntax (in my experiments, query is defined as a list of termsconnectedby disjunctions), I implement the score method inside Scorer ratherinside
sub classes.
What I did is every time when Scorer.score are called, I pass thecurrent
document number via doc(), and read out the term frequencies from
IndexReader. getTermFreqVector(int docNumber, String field)  method. I
found this operation is super slow. The major cost is spent on themethod
getTermFreqVector.

Term Vectors are stored differently from TermDocs, so I am notsurprised that they are slower. There probably could be someoptimizations made to their storage, but I would guess that mostpeople don't use Term Vecs all that much, so no one has looked tooclosely at where optimizations could be made. They almost certainlyare not using them to loop over all the docs in the system. Mostscenarios, I believe are for highlighting results on a doc-by-docbasis or for relevance feedback/more like this. Think of TermVectors as an after the fact addition to Lucene for conveniencepurposes, not for scoring/performance.

(3) Later I notice in the implementation in TermScorer, there is a
function call
IndexReader.termDocs.read(docs, freqs); // refill buffer
And I read the comments for this function in IndexReader, it says
/** Attempts to read multiple entries from the enumeration, up tolength of* docs. Document numbers are stored in docs, andterm* frequencies are stored in freqs. The freqsarray must
be as
 * long as the docs array.
 *
* Returns the number of entries read. Zero is only returnedwhen the
 * stream has been exhausted. */
So different from IndexReader.getTermFreqVector, which read outterm-freqvectors for a document, this function read doc-freq vectors for aterm. Ifind this method call is extremely faster. At least two magnitudesfasterthan getTermFreqVector, if I want to get all term-freqs for giventerms
for all docs. I do not know the reason why there is such difference, I
cannot find a document describing the implementation difference and
purpose among these two.

(4) About extending Lucene’s scoring function.
If we want to implement any arbitrary ranking/scoring for term-frequency
based algorithms, we need the scoring fits the following framework
Any term-frequency scoring can be expressed in the following ways(let’ssimplify that there is only one field so that we can ignore boostfactor
at this time):
Sigma (t in q) [TermWeight(t in d)*TermWeight(t in q) / lengthNorm(d) /
lengthNorm(q)]

Since lengthNorm(q) is the same for each document, it will not affect
ranking, we just ignore it. We further separate term weight into three
parts, term weight related to document, term weight related toquery, termweight related to corpus (not related to document or query) anddocument
length norm.

Sigma (t in q) [TW(t in d)*TW(t in q) * TW (t) / lengthNorm(d)]

We can notice this matches Lucene’ ranking function

Sigma (t in q) [tf(t in d)*1*idf (t) / lengthNorm(d)]. To save
computational cost, lengthNorm(d) is pre-calculated when indexing the
corpus. So Lucene’ lengthNorm(d) does not involve any corpusstatisticsinto calculation, it is merely the # of terms inside this document.On theother hand, it treats the term weight in query is the same as 1.That isLucene does not differentiate terms important inside query. If wewant to
emphasis a term twice we need to put two terms inside the query. For
example, instead of search “boat” we put “boat boat”, do I understand
correctly?
So, generally, if we can update the lengthNorm(d) inside theindexing code
or via post-processing after all documents are indexed, Lucene can
implement any arbitrary ranking function such as BM25. A pity isthat it
does not directly support query term weight in the query language.

We like patches and will at least consider and discuss most any patchthat is well thought out, backward compatible and tested and wherethe author makes a good case for it.

(5) My final big question would be how Lucene really implement
ranking/scoring. We could notice there are two possible strategies.Each
of them will result in different flexibility if we need modify current
ranking algorithms.

The first strategy is Lucene generate document scores in a document by
document manner. In Scorer.score, we notice the framework is eachtime the
scorer meet a new document, Lucene will generate a score for this
document. This framework is simple and intuitive. And all wediscussed in
(4) will fit this framework. When you processing the current document,
lengthNorm(d) can be read, even if TW(t in d) has relation to
lengthNorm(d) we can calculate that accordingly. Unfortunately, Icannotfind any low-level code could be thinked related to thisimplementation
strategy. This makes me think the Scorer.score method is not the real
place Lucene implement its ranking. I am pretty confused about thispart.
Who can help me with this?

You might take a look at how Query/Weight/Scorers are implemented.For instance, I just added a BoostingTermQuery to Java version thatimplements this trifecta.

The second strategy is Lucene generate document scores in a term bytermmanner. For each term inside the query, Lucene calls termDocs.read(docs,freqs) to accumulate scores over the specific term dimension overall the
documents.  Under this framework (which I feel is current Lucene’s
implementation), what we discussed in (4) is not held. We need extend
current Lucene’s tf(float freq) function to tf(float freq, floatnorm), so
that arbitrary ranking could be implemented.

There is some discussion of flexible indexing approaches on LuceneJava, which also, to me anyway, implies flexible scoring. You mightsearch that archive for "flexible indexing" to see if anythingstrikes you.

(6) I have implemented a search module outside Lucene IndexSearcherwhich
could implement any arbitrary ranking over vector query (a list of
disjunctive terms). The term-by-term implementation is much fasterthan myprevious document-by-document implementation. But I currently stillcannotencode the ranking module under Lucene’ Boolean engines, that isonly rank
the documents retrieved (only rank the documents satisfy the condition
specified by the query).

__________________________________________________________________
I do not know whether I described my points clear, it is complexand hardto write in a short plain text message. Maybe I should post a PDFversiontech-report online so that the problem is stated clearer. I am notsure my
understanding is correct. Thanks for any help and comments.


Xiangyu Jin
Department of Computer Science
University of Virginia
Charlottesville, VA 22903


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ

Re: Deeper Ranking Issues in Lucene

Reply via email to