On 22/7/2010 9:20 PM, Shai Erera wrote:
How is that different than extending QP?
Mainly because the problem I'm having isn't there, and doing it from there doesn't feel right, and definitely not like solving the issue. I want to explore what other options there are before doing anything, and I started this thread because I hit a dead end after seeing Similarity can no more be of help.

About the "song of songs" example -- the result you describe is already what
will happen. A document which contains just the word 'song' will score lower
than a document containing "song of songs".
Incorrect, and I have a sample app to show that (this is how I thought of this example for the first place).

Since while indexing the 2 words will be saved into index as 1:(song song$) 2:(song songs$), short documents with one word "song" will score higher than longer documents with "song of songs". This is a product of Lucene's default tf/idf implementation which cares about a field's length, and at this stage I want to avoid replacing it (with BM25 for example).

Also, what I'd do in such a case
is search for the phrase (in addition to the rest), 'cause documents
containing the word "songs" 100 times will score higher than the single
document that will contain "song of songs" once ...
In one of my applications I am providing an "as typed" capability, which does exactly what you are suggesting (looking for the $-ed terms only), but I want my original analyzer (the one that also looks for non $-ed terms) to do better scoring. Without this the implementation is somewhat broken...

If you just want a query "abc def" to rank higher if a document contains the
exact words, then I'd go w/ the QP extension approach, or do other
sophistication like searching for 'abc' '\"abc\"' etc. or something like
that. There are many tricks you can do on your end, w/o overriding much in
Lucene. Still, IMO extending QP is the easiest and gives you the control you
need.
I am overriding stuff in Lucene either way. I also don't want an exact match of a phrase to rank higher; I want an original term (saved as-is with a $ marker) to score higher than a stemmed / lemmatized one (without the marker). Sorry if the thread's title is misleading.

I'd have used payloads if it wasn't costly. So my question is: where do I have control over boosting (or scoring), and also have access to the term's text?

Thanks,

Itamar.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to