On Sep 3, 2008, at 3:00 PM, Michael McCandless wrote:
Obviously we can't default everything perfectly since at some point there are hard tradeoffs to be made and every app is different, but if SweetSpotSimilarity really gives better relevance for many/most apps, and doesn't have any downsides (I haven't looked closely myself), I think we should get it into core?
Well, we only have 2 data points here: Hoss' original position that it was helpful, and Doron's Million Query work. Has anyone else reported benefit? And in that regard, the difference between OOTB and SweetSpot was 0.154 vs. 0.162 for MAP. Not a huge amount, but still useful. In that regard, there are other length normalization functions (namely approaches that don't favor very short documents as much) that I've seen benefit applications as well, but as Erik is (in)famous for saying "it depends". In fact, if we go solely based on the million query work, we'd be better off having the Query Parser create phrase queries automatically for any query w/ more than 1 term (0.19 vs 0.154) before we even touch length normalization.
I've long argued that Lucene needs to take on the relevance question more head on, and in an open source way, until then, we are merely guessing at what's better, w/o empirical evidence that can be easily reproduced. TREC is just one data point, and is often discounted as being all that useful in the real world.
I'm on the fence, though. I agree w/ Hoss that core should be "core" and I don't think we want to throw more and more into core, but I also agree w/ Mike in that we want good, intelligent defaults for what we do have in core.
-Grant --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]