On 19/7/2010 5:50 PM, Shai Erera wrote:
If your analyzer outputs b and b$ in the same position, then the below query
will already be what the QP output today If you want to incorporate
boosting, I can suggest that you extend QP, override newTermQuery for
example, and if the term is a stemmed term, then set the query's boost
(Query.setBoost) accordingly. Would that work for you?
I want to avoid overriding the QP, and do this as a pluggable extension.
What other options do I have other than what you've suggested?
Ideally, that would be through a class or a function I can override or
extend, so each term hit while searching will be examined. By checking
its type and text (for suffix), that interface could double its weight
(or boost). The similarity functions I mentioned could have provided
this ability (see below). How can this be done without them?
You'll need to check whether you want to boost terms inside phrases, or
entire phrases, and then override more methods from QP. But that approach
will get you the native product of the engine, I think.
Just to make sure we are on the same page here, here's an example
(assuming the default tf/idf implementation in Lucene).
I want to make sure anyone searching for "song of songs" will find texts
discussing the biblical book, and have them ranked the highest, instead
of having short texts containing one word "song" score higher.
So what I do is have my stemming analyzer save the string "song of
songs" like this, where each parenthesis represents a token position:
(song song$) (song songs$).
The part I'm missing is how to score terms with suffixes higher. The
best approach seem to be looking at the term read by IndexReader and
boost this finding somehow. The assumption is if IndexReader has read
the term songs$ it has been looked for, and therefore this is the exact
word that has been queried for.
Which is the best Lucene part to hijack for this mission?
Alternatively, you
can set a payload on the stemmed terms and incorporate that into Similarity,
but that's more costly.
I had mentioned Payloads - this will get me exactly what I want but as
you say are quite costly when used for almost every term in the index.
If I could replace the suffix with Payloads I would have done this (byte
vs. byte), but I'm using the suffix for one other thing.
I don't follow that's been deprecated on Sim that you cannot use anymore?
All I see are 3 deprecated static methods which are related to norms ...
In 2.3.2 there were these functions:
public float idf(Term term, Searcher searcher)
public float idf(Collection terms, Searcher searcher)
These have been deprecated somewhere between that version and 2.9.2, and
it seems like I could have used those for what I'm trying to do.
Thanks,
Itamar.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org