I meant search this mail archive.... Erick
On 12/9/06, Scott Smith <[EMAIL PROTECTED]> wrote:
I've googled for custom scorers and haven't found anything. If anyone can point me to some posts, that would be appreciated. Sounds like setting the boost to zero works (see Daniel Naber's post), but that seems like overkill. I'll take a look at filters as well. ________________________________ From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Fri 12/8/2006 7:06 PM To: java-user@lucene.apache.org Subject: Re: de-boosting fields I've certainly seen references to writing custom scorers, so it's possible. you might find valuable hints by searching the mail archive. I'll leave it to the more expert folks to suggest which is your best option. Although (and I'm talking beyond my competence here), it *may* work for you to assemble a Filter for the category part of your query and use that instead of including the category in your query. As I understand it, filters don't contribute (or all contribute identically) to the score, leaving the search you're doing on body to determine your relevance, which seems like what you're after. Filters even work with something called a ConstantScoreQuery as I remember, which is a hint <G>. But again, don't be surprised if one of the more expert folks comes up with a *much* better idea <G> Best Erick On 12/8/06, Scott Smith <[EMAIL PROTECTED]> wrote: > > I have a collection of documents for which I've always returned the > results sorted on the date/time of the document (using a sort object in > the search method on my Searcher). It works great. > > > > Suddenly, I have a requirement to return the documents in relevancy > order. So, that's easy (I thought); simply call search() without a sort > object. Unfortunately, the results I got were not what I expected. So, > I added some code to have lucene explain how it was getting the score > and then things became clearer. > > > > Each document has all of the words in the document indexed in a field > called "Body" (vanilla unstored, indexed field). However, there is also > some category information which is kept in a keyword field called > "Category". A document may belong to a large number of categories > (10-70). > > > > When I search, I generate a query which says "give me all of the > documents, in relevancy order, which contain one or more of the > following words: word1, word2, word 3-and it also must be in at least > one of the following categories: category1, category2, ..., categoryN. > > > > What I found was that lucene was using the category information as part > of what it uses to compute the relevancy score (in hindsight, not too > surprising). The problem is that the numbers from the category hits in > "Category" overwhelm the numbers from the word hits in the "Body". So, > my most relevant document may only have a single word hit and a document > way down in the list (in terms of relevancy) might have a number of word > hits. For example, in one search, the top scoring document scored > .2650. Of that, the category information contributed .2635 to that > score-meaning the word hits only contributed .0015 to the relevancy. > This is the opposite of what I want. > > > > I'd be happy to simply eliminate the category information from the score > computation all together (base relevancy scores only on the words which > hit in the "Body" field). Another solution would be to change the boost > on the category information to some small number (zero?) or raise the > Body field boost to a much larger number or both. > > > > What is the best way to do this? Is changing the boost the right > answer? Can a field's boost be zero? Is there a way to write a custom > scorer that gets inserted somewhere? Any suggestions would be > appreciated. > > > > Scott > > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]