Hi, I've been working on seeing whether we can make use of impacts in Amazon search and I have some questions. To date, we haven't used Lucene's scoring APIs at all; all of our queries are constant score, we early terminate based on a sorted index rank and then re-rank using custom non-Lucene ranking models. There is now an opportunity (some early ranking models have gotten simplified) for us to move some of the ranking workload into Lucene where we should be able to benefit from skipping hits via impacts.
I'm struggling with a typical query (not our actual setup, but illustrates the functional gap) that is an OR-query something like: title:Harry_Potter_and_the_sorcerers_stone^100 (+fulltext:harry +fulltext:potter +sorcerer + stone) Suppose there is only one document with that title, but a few dozen match all the individual terms. The one-word terms occur frequently in the fulltext field, but the title only once, yet it is a "high impact" term from the point of view of the query score. We don't index impacts for a term when docFreq < 128. This means we will never be able to skip low-scoring documents for this query, assuming that the score of the fulltext clause will always be much less than the score from the exact title match (which is by design - we always want exact title matches to rank highly). Even when min-competitive-score is for a document that has each word twice, we still can't skip documents where they only occur once, because the maximum score for the title scorer is the maximum *over the whole index* -- basically the scorer is thinking there might be another exact title match somewhere deeper in the index *even though its postings have already been exhausted*. I have only just started to look at the impacts code and don't have any clear idea whether this is difficult to fix, or whether I may have misconfigured something, but thought I would ask here to see if anyone has any idea. Things I did check: - the query is running in TOP_SCORES mode - the collector is calling Scorer.setMinimumScore with a low score, and subsequently collecting all matching hits even though their scores are all lower than the min - the title impacts is represented by SlowImpactsEnum One thing that may be relevant is that I am using a custom Query/Weight/Scorer wrapping the two clauses in order to modify their scores, because I am trying to mimic a pre-existing scoring function. These apply a linear function with an offset, scale and a maximum ceiling (so can't be done just with boosts as shown above). This Scorer implements score/getMaxScore by applying its modifications to the underlying scores, setMinCompetitiveScore basically inverts that, and advanceShallow delegates to the inner Scorer. I didn't implement anything around BulkScorer - maybe that's a gap? any pointers appreciated! --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org