I'm late to the thread, and although it looks like the discussion is over, I'll inline a Q for Jake.
>I should add in my $0.02 on whether to just get rid of queryNorm() altogether: >>> >>> -1 from me, even though it's confusing, because having that call there >>> (somewhere, at least) allows you to actually do compare scores across >>> queries if you do the extra work of properly normalizing the documents as >>> well (at index time). >> >> >>Do you have some references on this? I'm interested in reading more on the >>subject. I've never quite been sold on how it is meaningful to compare >>scores and would like to read more opinions. > >References on how people do this *with Lucene*, or just how this is done in >general? There are lots of papers on fancy things which can be done, but I'm >not sure where to point you to start out. The technique I'm referring to is >really just the simplest possible thing beyond setting your weights "by hand": >let's assume you have a boolean OR query, Q, built up out of sub-queries q_i >(hitting, for starters, different fields, although you can overlap as well >with some more work), each with a set of weights (boosts) b_i, then if you >have a training corpus (good matches, bad matches, or ranked lists of matches >in order of relevance for the queries at hand), *and* scores (at the q_i >level) which are comparable, You mentioned this about 3 times in this thread (contrib/queries wants you!) It sounds like you've done this before (with Lucene?). But how, if the scores are not comparable, and that's required for this "field boost learning/training" to work? Thanks, Otis > then you can do a simple regression (linear or logistic, depending on whether > you map your final scores to a logit or not) on the w_i to fit for the best > boosts to use. What is critical here is that scores from different queries > are comparable. If they're not, then queries where the best document for a > query scores 2.0 overly affect the training in comparison to the queries > where the best possible score is 0.5 (actually, wait, it's the reverse: > you're training to increase scores of matching documents, so the system tries > to make that 0.5 scoring document score much higher by raising boosts higher > and higher, while the good matches already scoring 2.0 don't need any more > boosting, if that makes sense). > >There are of course far more complex "state of the art" training techniques, >but probably someone like Ted would be able to give a better list of >references on where is easiest to read those from. But I can try to dredge up >some places where I've read about doing this, and post again later if I can >find any. > > -jake > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org