indeed, frequency usage is collection and use case dependant...
Not directly your case, but the idea is the same.
We used this information in spell/typo-variations context to
boost/penalize similarity, by dividing terms into a couple of freq
based segments.
Take an example:
Maria - Very High Freq
Marina - Very High Freq
Mraia - Very Low Freq
similarity(Maria, Marina) is by string distance measures very high,
practically the same as (Maria, Mraia) but the likelihood that you
mistyped Mraia is an order of magnitude higher than if you hit VHF-VHF
pair.
Point being, frequency hides a lot of semantics, and how you tune it,
as Martin said, does not really matter, if it works.
We also never found theory that formalize this, but it was logical,
and it worked in practice.
What you said, makes sense to me, especially for very big collections
(or specialized domains with limited vocabulary...) the bigger the
collection, the bigger garbage density in VLF domain (above certain
size of the collection). If vocabulary in your collection is
somehow limited, there is a size limit where most of new terms (VLF)
are crapterms. One could try to estimate how saturated a
collection is...
cheers,
eks
On Wed, Apr 13, 2011 at 9:36 PM, Marvin Humphrey mar...@rectangular.com wrote:
On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote:
Excuse me for somewhat of an offtopic, but have anybody ever seen/used
-subj- ?
Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
Traditional log(N/x) tail, but when nearing zero freq, instead of
going to +inf you do a nice round bump (with controlled
height/location/sharpness) and drop down to -inf (or zero).
I haven't used that technique, nor can I quote academic literature blessing
it. Nevertheless, what you're doing makes sense makes sense to me.
Rationale is that - most good, discriminating terms are found in at
least a certain percentage of your documents, but there are lots of
mostly unique crapterms, which at some collection sizes stop being
strictly unique and with IDF's help explode your scores.
So you've designed a heuristic that allows you to filter a certain kind of
noise. It sounds a lot like how people tune length normalization to adapt to
their document collections. Many tuning techniques are corpus-specific.
Whatever works, works!
Marvin Humphrey
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org