Re: Any way to ignore repeated terms in TF calculation?

Israel Tsadok Sun, 11 Jan 2009 07:00:50 -0800

>
> you can solve your problem at search time by passing a custom Similarity
> class that looks something like this:
>
>   private Similarity similarity = new DefaultSimilarity() {
>>    public float tf(float v) {
>>      return 1f;
>>    }
>>    public float tf(int i) {
>>      return 1f;
>>    }
>>  };
>>
>
Thanks, but it seems that this solution would make all words completely
equal without regard to their frequency. This is more extreme than what I
had in mind. Chris Hostetter's suggestion of SweetSpotSimilarity makes the
situation a little better, but still doesn't make the distinction between
repeated words and words that appear in different location in the text.


For example, an encyclopedic article about badgers would probably have the
word "badger" many times throughout its text. I would like to make such
article score much higher than an unrelated article that simply used the
word badger three times in 800 words. Term Frequency works well in this
regard, but fails to make the encyclopedic article rank higher than
documents that simply contain the word "badger" and not much else (
http://tinyurl.com/8p5jsj).

Paul Libberecht's comment has a point - if I eliminate duplicates in the
tokenizer both at indexing time and in the query parser, I should be able to
make search work with reduced effect for repeated terms. However, that
approach has two downsides:
1. It will be impossible to find articles with (specifically) "badger badger
badger" in them.
2. Sometimes two words are repeated (barack obama barack obama barack obama)
which makes the tokenizer approach unsuitable.

Another option I'm considering is a negative boost to documents that contain
repeated terms, but this is too general, since such a document may be very
relevant to searches about different terms. I really only want to change the
tf of the offending repeated term.

Thanks for all your suggestions, and I'd appreciate any other ideas.

Israel

Re: Any way to ignore repeated terms in TF calculation?

Reply via email to