On Thu, Apr 7, 2011 at 7:29 PM, Marvin Humphrey <[email protected]> wrote:
> When weighting an arbitrarily complex query, we have to allow the scoring > model the option of having member variables and methods which perform the > weighting, and we have to allow for the possibility that it will proceed in an > arbitrary number of stages, requiring gradual modifications to complex > internal states before collapsing down to a final "weight" -- if it ever does. > FYI: while I am not particularly knowledgable on lucy, I wanted to agree with this. We (lucene-java) seem to be heading in the same direction so I wanted to share a few things I have found. Obviously I am not saying any of this is the best at all (especially not knowing lucy's internals) but it might be some food for thought. 1. Its really important to tease out this scoring from the matching as much as possible in my opinion, because if you go the route of supporting bulk postings compression formats (e.g. PFOR), the Matcher will become significantly more complicated. For example, you can take a look at our "matching" for bulk postings formats here and see how complicated it is: http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene/src/java/org/apache/lucene/search/TermScorer.java. At first I was terrified when Mike Mccandless implemented it this way, but upon review of a lot of literature I found reference after reference to this fact, that supporting these bulk compression formats introduces significant complexity. 2. While its true most scoring systems have some IDF-ish type concept of "weight" that can be precomputed up front, some scoring systems cannot collapse to a single value up front. For this reason I think its good for this "weight" to be opaque to the similarity implementation: it might contain 2 or 3 stashed values (some DFR formulations that use a "tf-itf" like formula that cannot be teased into a single float up-front), or even contain nothing at all (constant score?). For lucene-java though, I'm considering the idea of allowing the opaque weight to expose a "getValue" float of some sort, basically some estimation of the weight, which would support the query normalization of tf/idf for example. 3. I am not sure if Lucy supports the concept of "Explanation" (the ability to explain the score) like Lucene does, but if so, its important to think about this in parallel: the splitting of Matching from Scoring applies to Explanations too. In other words, in lucene-java terms, the responsibility for explaining the score must be factored out into Similarity too. Below is a long explanation of the currently the in-progress design for lucene-java (in case its useful): if you want to see the work-in-progress have a look at http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/ 1. What is "Weight" today in lucene-java' purpose instead is to "setUp" the scoring: a. seek to the term, pass its statistics to the Similarity, get back an opaque "weight" b. create matchers per-segment, passing the similarity and some context (e.g. this up-front weight and the per-segment IndexReader). 2. the matcher asks the similarity for a "Scorer", I found there are two types needed to support all lucene queries: a. ExactDocScorer: this one must implement 'float score(int docid, int tf)': used for termquery, exact phrasequery, etc. b. SloppyDocScorer: this one must implement 'float score(int docid, float tf)': used for sloppy phrasequery, spanqueries, etc. The justification behind the two is that not only are they doing different things and might be implemented totally differently, but it makes it easier to optimize the exact case (e.g. TFIDF can do its score caching) 3. the similarity is a factory for these two types of scorers: it must implement a 'createExactDocScorer' and 'createSloppyDocScorer'. These factory methods get the context from (1b), which allows them to do things like pull the per-segment norms or whatever they need up-front. So the idea is that the base "Similarity" only supports these 3 methods: Weight computeWeight(top-level context); ExactDocScorer createExactDocScorer(Weight weight, per-segment context); SloppyDocScorer createSloppyDocScorer(Weight weight, per-segment context); But a specific model (e.g. TFIDF) would implement this low-level unfriendly API and present a nice subclassable API on top of it for extension. For some examples implemented on top of this: 1. lucene's traditional "TF/IDF": http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java 2. Dirichlet LM: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockLMSimilarity.java 3. BM25: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockBM25Similarity.java 4. DFR I(F)L2: https://issues.apache.org/jira/secure/attachment/12474975/LUCENE-2959_mockdfr.patch Its messy, its a work in progress, and you can probably see some limitations: for example as mentioned earlier until we make the weight opaque, some formulations cannot yet be implemented. Additionally the explanations refactoring is not yet done. But you can see how the first two systems are optimized reasonably well, and i experimented with other optimizations (e.g. the scorer for TF/IDF is specialized to remove a per-document "if" for the omitNorms=true/false).
