> > you can solve your problem at search time by passing a custom Similarity > class that looks something like this: > > private Similarity similarity = new DefaultSimilarity() { >> public float tf(float v) { >> return 1f; >> } >> public float tf(int i) { >> return 1f; >> } >> }; >> > Thanks, but it seems that this solution would make all words completely equal without regard to their frequency. This is more extreme than what I had in mind. Chris Hostetter's suggestion of SweetSpotSimilarity makes the situation a little better, but still doesn't make the distinction between repeated words and words that appear in different location in the text.
For example, an encyclopedic article about badgers would probably have the word "badger" many times throughout its text. I would like to make such article score much higher than an unrelated article that simply used the word badger three times in 800 words. Term Frequency works well in this regard, but fails to make the encyclopedic article rank higher than documents that simply contain the word "badger" and not much else ( http://tinyurl.com/8p5jsj). Paul Libberecht's comment has a point - if I eliminate duplicates in the tokenizer both at indexing time and in the query parser, I should be able to make search work with reduced effect for repeated terms. However, that approach has two downsides: 1. It will be impossible to find articles with (specifically) "badger badger badger" in them. 2. Sometimes two words are repeated (barack obama barack obama barack obama) which makes the tokenizer approach unsuitable. Another option I'm considering is a negative boost to documents that contain repeated terms, but this is too general, since such a document may be very relevant to searches about different terms. I really only want to change the tf of the offending repeated term. Thanks for all your suggestions, and I'd appreciate any other ideas. Israel