Re: An IDF variation with penalty for very rare terms

Marvin Humphrey Wed, 13 Apr 2011 12:40:49 -0700

On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote:
> Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- 
> ?
> Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
> Traditional log(N/x) tail, but when nearing zero freq, instead of
> going to +inf you do a nice round bump (with controlled
> height/location/sharpness) and drop down to -inf (or zero).
 
I haven't used that technique, nor can I quote academic literature blessing
it.  Nevertheless, what you're doing makes sense makes sense to me.


> Rationale is that - most good, discriminating terms are found in at
> least a certain percentage of your documents, but there are lots of
> mostly unique crapterms, which at some collection sizes stop being
> strictly unique and with IDF's help explode your scores.

So you've designed a heuristic that allows you to filter a certain kind of
noise.  It sounds a lot like how people tune length normalization to adapt to
their document collections.  Many tuning techniques are corpus-specific.
Whatever works, works!

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: An IDF variation with penalty for very rare terms

Reply via email to