Re: Use a date field for ranking

Christoph Kiehl Fri, 07 Jan 2005 18:20:13 -0800

Chris Hostetter wrote:

: we are currently implementing a search engine for a news site. Our goal
: is to have a search result that uses the publish date of the documents
: to boost the score of the documents.

: have to use something that boosts the scores at _search_ time.

>

1) There is a way to boost individual Query objects (which you may then
compose into a Tree of BooleanQueries) see Query.setBoost(float)

Yes, I know I can boost Query objects, but that is not the same as boosting the document score by a factor. By boosting query objects I _add_ values to the score. Let me show you an example:

I may use queries like this:

Query 1:
(a word that gets a score of 0.1) OR (date:20050108^3 OR date:20050107^1)

Query 2:
(a word that gets a score of 0.01) OR (date:20050108^3 OR date:20050107^1)

The date part of the clause gets a constant score of 0.3. So the total score of the queries will be:

Query 1: 0.4
Query 2: 0.31

If I had used a boost of 3.0 per document and left the date part of the query out I would have:

Query 1: 0.3
Query 2: 0.03

Which maintains the original proportion. Now if I want to specify a function (like 1/x) that calculates the boost factor of a specific publish date I can't emulate this by using Query boosts because the query boost must be adjusted to the first part of the query to achieve an equal distribution for any query.

I'm sure there is a mathematical term which describes exactly this problem - but I'm no mathematician ;) So I hope you understand my issues.

Additionally the construct above find also documents that have the right date but don't contain the first part of the query. So we might use a query like this:

(a word) AND (date:20050108^3 OR date:20050107^1)

But now I have to specify _all_ possible dates in the date part to reach all documents the index contains. This smells ;) Because it's all only an emulation of the real strategy.

2) if you are planning to rebuild your index on a regular basis (ie:
nightly) then you can easily apply boosts to your documets when you index
them.


Unfortunately this is no option because the index is updated incrementally.

3) I'm sure there is a very cool and efficient way to do this using a
custom Similarity implimentation (which somhow causes the default score
to be divided by the age of the document) but i've never acctualy played
with the SImilarity class, so i won't say for certain it can be done that
way (hopefully someone else can chime in)

AFAIK, Similarity can only be used on term level. But as outlined above I need a boost factor on document level.

Thanks for your input,
Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Use a date field for ranking

Reply via email to