Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- ? Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png Traditional log(N/x) tail, but when nearing zero freq, instead of going to +inf you do a nice round bump (with controlled height/location/sharpness) and drop down to -inf (or zero).
Should be cool when doing cosine-measure(or something comparable)-based document comparisons (eg. in a "more like this" query, to mention Lucene at least once :) ), over dirty data. Rationale is that - most good, discriminating terms are found in at least a certain percentage of your documents, but there are lots of mostly unique crapterms, which at some collection sizes stop being strictly unique and with IDF's help explode your scores. -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org