Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- ?
Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
Traditional log(N/x) tail, but when nearing zero freq, instead of
going to +inf you do a nice round bump (with controlled
height/location/sharpness) and drop down to -inf (or zero).

Should be cool when doing cosine-measure(or something
comparable)-based document comparisons (eg. in a "more like this"
query, to mention Lucene at least once :) ), over dirty data.
Rationale is that - most good, discriminating terms are found in at
least a certain percentage of your documents, but there are lots of
mostly unique crapterms, which at some collection sizes stop being
strictly unique and with IDF's help explode your scores.

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to