Hi Israel, I am trying to put the problem more concisely. 1. Fields where term frequency is very very relevant. E.g. Body: Example: if TF of badger in Body of doc 1 > TF of badger in Body of doc 2 doc 1 scores higher.
2. Fields where term frequency is irrevalent Page_Title: Example: TF of badger in PageTitle doesn't affect the score. If that is the case: then one solution is 1. Build the query programmatically. 2. Form Normal Queries on FieldType 1 ( e.g. Body) 3. Form ConstantScore variation of queries on FieldType 2 (e.g. Page_Title, ConstantScoreTermQuery) There is no need to change anything at index time. I hope that helps. Thanks Umesh On Sun, Jan 11, 2009 at 8:30 PM, Israel Tsadok <itsa...@gmail.com> wrote: > > > > you can solve your problem at search time by passing a custom Similarity > > class that looks something like this: > > > > private Similarity similarity = new DefaultSimilarity() { > >> public float tf(float v) { > >> return 1f; > >> } > >> public float tf(int i) { > >> return 1f; > >> } > >> }; > >> > > > Thanks, but it seems that this solution would make all words completely > equal without regard to their frequency. This is more extreme than what I > had in mind. Chris Hostetter's suggestion of SweetSpotSimilarity makes the > situation a little better, but still doesn't make the distinction between > repeated words and words that appear in different location in the text. > > For example, an encyclopedic article about badgers would probably have the > word "badger" many times throughout its text. I would like to make such > article score much higher than an unrelated article that simply used the > word badger three times in 800 words. Term Frequency works well in this > regard, but fails to make the encyclopedic article rank higher than > documents that simply contain the word "badger" and not much else ( > http://tinyurl.com/8p5jsj). > > Paul Libberecht's comment has a point - if I eliminate duplicates in the > tokenizer both at indexing time and in the query parser, I should be able > to > make search work with reduced effect for repeated terms. However, that > approach has two downsides: > 1. It will be impossible to find articles with (specifically) "badger > badger > badger" in them. > 2. Sometimes two words are repeated (barack obama barack obama barack > obama) > which makes the tokenizer approach unsuitable. > > Another option I'm considering is a negative boost to documents that > contain > repeated terms, but this is too general, since such a document may be very > relevant to searches about different terms. I really only want to change > the > tf of the offending repeated term. > > Thanks for all your suggestions, and I'd appreciate any other ideas. > > Israel >