indeed, frequency usage is collection and use case dependant... Not directly your case, but the idea is the same.
We used this information in spell/typo-variations context to boost/penalize similarity, by dividing terms into a couple of freq based segments. Take an example: Maria - Very High Freq Marina - Very High Freq Mraia - Very Low Freq similarity(Maria, Marina) is by string distance measures very high, practically the same as (Maria, Mraia) but the likelihood that you mistyped Mraia is an order of magnitude higher than if you hit VHF-VHF pair. Point being, frequency hides a lot of semantics, and how you tune it, as Martin said, does not really matter, if it works. We also never found theory that formalize this, but it was logical, and it worked in practice. What you said, makes sense to me, especially for very big collections (or specialized domains with limited vocabulary...) the bigger the collection, the bigger "garbage density" in VLF domain (above certain size of the collection). If "vocabulary" in your collection is somehow limited, there is a size limit where most of new terms (VLF) are "crapterms". One could try to estimate how "saturated" a collection is... cheers, eks On Wed, Apr 13, 2011 at 9:36 PM, Marvin Humphrey <mar...@rectangular.com> wrote: > On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote: >> Excuse me for somewhat of an offtopic, but have anybody ever seen/used >> -subj- ? >> Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png >> Traditional log(N/x) tail, but when nearing zero freq, instead of >> going to +inf you do a nice round bump (with controlled >> height/location/sharpness) and drop down to -inf (or zero). > > I haven't used that technique, nor can I quote academic literature blessing > it. Nevertheless, what you're doing makes sense makes sense to me. > >> Rationale is that - most good, discriminating terms are found in at >> least a certain percentage of your documents, but there are lots of >> mostly unique crapterms, which at some collection sizes stop being >> strictly unique and with IDF's help explode your scores. > > So you've designed a heuristic that allows you to filter a certain kind of > noise. It sounds a lot like how people tune length normalization to adapt to > their document collections. Many tuning techniques are corpus-specific. > Whatever works, works! > > Marvin Humphrey > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org