Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

Walt Stoneburner Thu, 24 May 2007 08:22:49 -0700

Hi,

 I'm trying to figure what I need to do with Lucene to score a
document higher when it has a larger number of unique search terms
that are hit, rather than term frequency counts.


 A quick example.

 If I'm searching for "BIRD CAT DOG" (all should clauses), then I want

  ...a document with BIRD, CAT, and DOG terms, each only  appearing
once, in it to score higher than

  ...a document with BIRD, CAT, CAT, CAT, CAT, CAT, CAT, CAT.

 The rationale behind this is that if something "fits" my query
better by hitting more terms, I don't want it to be drowned out by a
document that simply mentions a subset of keywords a lot of times.

 And, the tricky part: ideally I'd like to be able to switch between
the two schemes, so the user can get documents scored wither way.


 So are I've been reading the 'score and frequency' thread at
http://www.gossamer-threads.com/lists/lucene/java-user/8916, where
Niraj seems to have a similar problem.  He tries things overriding
term frequencies, tf(), and setting the default similarity.

 Unfortunately, it isn't long before the reply chain is 18 layers
deep (I counted), and it never becomes clear if a solution was
resolved, so I wasn't certain if I was on the right research path or
not.  It started to appear that some of the scoring might be done at
index time, but that didn't make sense to me, since weights and such
can be done at query time.

 Is there any way to have Lucene score based on the discrete number
of unique terms found, rather than how often a given term appears in a
document?

Thanks,
-wls
ps.  When replying to this, it'd be great if not pertinent content to
the reply were trimmed in the response.  I don't want to cause a
similar message snowball to roll down the hill, picking up angle
brackets along the way.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

Reply via email to