I'm late to the thread, and although it looks like the discussion is over, I'll 
inline a Q for Jake.

>I should add in my $0.02 on whether to just get rid of queryNorm() altogether: 
>>>
>>>  -1 from me, even though it's confusing, because having that call there 
>>> (somewhere, at least) allows you to actually do compare scores across 
>>> queries if you do the extra work of properly normalizing the documents as 
>>> well (at index time).
>>
>>
>>Do you have some references on this?  I'm interested in reading more on the 
>>subject.  I've never quite been sold on how it is meaningful to compare 
>>scores and would like to read more opinions.
> 
>References on how people do this *with Lucene*, or just how this is done in 
>general?  There are lots of papers on fancy things which can be done, but I'm 
>not sure where to point you to start out.  The technique I'm referring to is 
>really just the simplest possible thing beyond setting your weights "by hand": 
>let's assume you have a boolean OR query, Q, built up out of sub-queries q_i 
>(hitting, for starters, different fields, although you can overlap as well 
>with some more work), each with a set of weights (boosts) b_i, then if you 
>have a training corpus (good matches, bad matches, or ranked lists of matches 
>in order of relevance for the queries at hand), *and* scores (at the q_i 
>level) which are comparable,

You mentioned this about 3 times in this thread (contrib/queries wants you!)
It sounds like you've done this before (with Lucene?).  But how, if the scores 
are not comparable, and that's required for this "field boost 
learning/training" to work?

Thanks,
Otis

> then you can do a simple regression (linear or logistic, depending on whether 
> you map your final scores to a logit or not) on the w_i to fit for the best 
> boosts to use.  What is critical here is that scores from different queries 
> are comparable.  If they're not, then queries where the best document for a 
> query scores 2.0 overly affect the training in comparison to the queries 
> where the best possible score is 0.5 (actually, wait, it's the reverse: 
> you're training to increase scores of matching documents, so the system tries 
> to make that 0.5 scoring document score much higher by raising boosts higher 
> and higher, while the good matches already scoring 2.0 don't need any more 
> boosting, if that makes sense).
>
>There are of course far more complex "state of the art" training techniques, 
>but probably someone like Ted would be able to give a better list of 
>references on where is easiest to read those from.  But I can try to dredge up 
>some places where I've read about doing this, and post again later if I can 
>find any.
>
>  -jake
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to