Re: Scoring exact matches higher in a stemmed field

Itamar Syn-Hershko Mon, 19 Jul 2010 11:26:33 -0700

On 19/7/2010 5:50 PM, Shai Erera wrote:

If your analyzer outputs b and b$ in the same position, then the below query
will already be what the QP output today If you want to incorporate
boosting, I can suggest that you extend QP, override newTermQuery for
example, and if the term is a stemmed term, then set the query's boost
(Query.setBoost) accordingly. Would that work for you?

I want to avoid overriding the QP, and do this as a pluggable extension.What other options do I have other than what you've suggested?

Ideally, that would be through a class or a function I can override orextend, so each term hit while searching will be examined. By checkingits type and text (for suffix), that interface could double its weight(or boost). The similarity functions I mentioned could have providedthis ability (see below). How can this be done without them?

You'll need to check whether you want to boost terms inside phrases, or
entire phrases, and then override more methods from QP. But that approach
will get you the native product of the engine, I think.

Just to make sure we are on the same page here, here's an example(assuming the default tf/idf implementation in Lucene).

I want to make sure anyone searching for "song of songs" will find textsdiscussing the biblical book, and have them ranked the highest, insteadof having short texts containing one word "song" score higher.

So what I do is have my stemming analyzer save the string "song ofsongs" like this, where each parenthesis represents a token position:(song song$) (song songs$).

The part I'm missing is how to score terms with suffixes higher. Thebest approach seem to be looking at the term read by IndexReader andboost this finding somehow. The assumption is if IndexReader has readthe term songs$ it has been looked for, and therefore this is the exactword that has been queried for.


Which is the best Lucene part to hijack for this mission?

Alternatively, you
can set a payload on the stemmed terms and incorporate that into Similarity,
but that's more costly.

I had mentioned Payloads - this will get me exactly what I want but asyou say are quite costly when used for almost every term in the index.If I could replace the suffix with Payloads I would have done this (bytevs. byte), but I'm using the suffix for one other thing.

I don't follow that's been deprecated on Sim that you cannot use anymore?
All I see are 3 deprecated static methods which are related to norms ...

In 2.3.2 there were these functions:

    public float idf(Term term, Searcher searcher)

    public float idf(Collection terms, Searcher searcher)

These have been deprecated somewhere between that version and 2.9.2, andit seems like I could have used those for what I'm trying to do.


Thanks,

Itamar.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Scoring exact matches higher in a stemmed field

Reply via email to