Re: Scoring exact matches higher in a stemmed field

Itamar Syn-Hershko Thu, 22 Jul 2010 14:45:19 -0700

On 22/7/2010 9:20 PM, Shai Erera wrote:

How is that different than extending QP?

Mainly because the problem I'm having isn't there, and doing it fromthere doesn't feel right, and definitely not like solving the issue. Iwant to explore what other options there are before doing anything, andI started this thread because I hit a dead end after seeing Similaritycan no more be of help.

About the "song of songs" example -- the result you describe is already what
will happen. A document which contains just the word 'song' will score lower
than a document containing "song of songs".

Incorrect, and I have a sample app to show that (this is how I thoughtof this example for the first place).

Since while indexing the 2 words will be saved into index as 1:(songsong$) 2:(song songs$), short documents with one word "song" will scorehigher than longer documents with "song of songs". This is a product ofLucene's default tf/idf implementation which cares about a field'slength, and at this stage I want to avoid replacing it (with BM25 forexample).

Also, what I'd do in such a case
is search for the phrase (in addition to the rest), 'cause documents
containing the word "songs" 100 times will score higher than the single
document that will contain "song of songs" once ...

In one of my applications I am providing an "as typed" capability, whichdoes exactly what you are suggesting (looking for the $-ed terms only),but I want my original analyzer (the one that also looks for non $-edterms) to do better scoring. Without this the implementation is somewhatbroken...

If you just want a query "abc def" to rank higher if a document contains the
exact words, then I'd go w/ the QP extension approach, or do other
sophistication like searching for 'abc' '\"abc\"' etc. or something like
that. There are many tricks you can do on your end, w/o overriding much in
Lucene. Still, IMO extending QP is the easiest and gives you the control you
need.

I am overriding stuff in Lucene either way. I also don't want an exactmatch of a phrase to rank higher; I want an original term (saved as-iswith a $ marker) to score higher than a stemmed / lemmatized one(without the marker). Sorry if the thread's title is misleading.

I'd have used payloads if it wasn't costly. So my question is: where doI have control over boosting (or scoring), and also have access to theterm's text?


Thanks,

Itamar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Scoring exact matches higher in a stemmed field

Reply via email to