On Tue, Nov 24, 2009 at 9:18 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote:
> I'm late to the thread, and although it looks like the discussion is over, > I'll inline a Q for Jake. > > > > >References on how people do this *with Lucene*, or just how this is done > in general? There are lots of papers on fancy things which can be done, but > I'm not sure where to point you to start out. The technique I'm referring > to is really just the simplest possible thing beyond setting your weights > "by hand": let's assume you have a boolean OR query, Q, built up out of > sub-queries q_i (hitting, for starters, different fields, although you can > overlap as well with some more work), each with a set of weights (boosts) > b_i, then if you have a training corpus (good matches, bad matches, or > ranked lists of matches in order of relevance for the queries at hand), > *and* scores (at the q_i level) which are comparable, > > You mentioned this about 3 times in this thread (contrib/queries wants > you!) > It sounds like you've done this before (with Lucene?). But how, if the > scores are not comparable, and that's required for this "field boost > learning/training" to work? > Well that's the point, right? You need to make the scores comparable, somehow. The most general thing you can do is figure out what the maximum possible score for a query is (if there is a maximum, which for most scoring systems there will be, given strictly positive doc norms) and normalize with respect to that. For Lucene, the simplest possible way to do this (I think?) is to swap in a true cosine (or something like Tanimoto) similarity instead of the doc-length normalized one (which may require externalizing the IDF). When the score for a tf-idf weighted document and a boost-idf weighted query (with both are normalized on the same scale) is exactly just the cosine of the angle between them, then scores become fairly comparable - they're all on a 0 to 1 scale. Now, longer fields/documents are still going to score way lower than shorter documents, for typical user-generated queries, but at least now "way lower" has more meaning than before (because it's "way lower *relative to 1.0*"). Frankly, I've even done this kind of logistic regression training of weights even when you use raw lucene scoring, and while it doesn't work completely (because scores are so incomparable), it's remarkable how well it ends up working (I guess in comparison to randomly setting your boosts by hand and simple A/B tests...) -jake > Thanks, > Otis > > > then you can do a simple regression (linear or logistic, depending on > whether you map your final scores to a logit or not) on the w_i to fit for > the best boosts to use. What is critical here is that scores from different > queries are comparable. If they're not, then queries where the best > document for a query scores 2.0 overly affect the training in comparison to > the queries where the best possible score is 0.5 (actually, wait, it's the > reverse: you're training to increase scores of matching documents, so the > system tries to make that 0.5 scoring document score much higher by raising > boosts higher and higher, while the good matches already scoring 2.0 don't > need any more boosting, if that makes sense). > > > >There are of course far more complex "state of the art" training > techniques, but probably someone like Ted would be able to give a better > list of references on where is easiest to read those from. But I can try to > dredge up some places where I've read about doing this, and post again later > if I can find any. > > > > -jake > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >