Hi, I'm trying to figure what I need to do with Lucene to score a document higher when it has a larger number of unique search terms that are hit, rather than term frequency counts.
A quick example. If I'm searching for "BIRD CAT DOG" (all should clauses), then I want ...a document with BIRD, CAT, and DOG terms, each only appearing once, in it to score higher than ...a document with BIRD, CAT, CAT, CAT, CAT, CAT, CAT, CAT. The rationale behind this is that if something "fits" my query better by hitting more terms, I don't want it to be drowned out by a document that simply mentions a subset of keywords a lot of times. And, the tricky part: ideally I'd like to be able to switch between the two schemes, so the user can get documents scored wither way. So are I've been reading the 'score and frequency' thread at http://www.gossamer-threads.com/lists/lucene/java-user/8916, where Niraj seems to have a similar problem. He tries things overriding term frequencies, tf(), and setting the default similarity. Unfortunately, it isn't long before the reply chain is 18 layers deep (I counted), and it never becomes clear if a solution was resolved, so I wasn't certain if I was on the right research path or not. It started to appear that some of the scoring might be done at index time, but that didn't make sense to me, since weights and such can be done at query time. Is there any way to have Lucene score based on the discrete number of unique terms found, rather than how often a given term appears in a document? Thanks, -wls ps. When replying to this, it'd be great if not pertinent content to the reply were trimmed in the response. I don't want to cause a similar message snowball to roll down the hill, picking up angle brackets along the way. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]