Re: An IDF variation with penalty for very rare terms

2011-04-15 Thread eks dev
indeed, frequency usage  is collection and use case dependant...
Not directly your case, but the idea is the same.

We used this information in spell/typo-variations context to
boost/penalize similarity, by dividing terms into a couple of freq
based segments.

Take an example:
Maria - Very High Freq
Marina - Very High Freq
Mraia - Very Low Freq

similarity(Maria, Marina) is by string distance measures very high,
practically the same as (Maria, Mraia) but the likelihood that you
mistyped Mraia is an order of magnitude higher than if you hit VHF-VHF
pair.

Point being, frequency hides a lot of semantics, and how you tune it,
as Martin said, does not really matter, if it works.

We also never found theory that formalize this, but it was logical,
and it worked in practice.

What you said, makes sense to me, especially for very big collections
(or specialized domains with limited vocabulary...) the bigger the
collection, the bigger garbage density in VLF domain (above certain
size of the collection). If  vocabulary in your collection is
somehow limited, there is a size limit where most of new terms (VLF)
are crapterms. One could try to  estimate how saturated a
collection is...


cheers,
eks


On Wed, Apr 13, 2011 at 9:36 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote:
 Excuse me for somewhat of an offtopic, but have anybody ever seen/used 
 -subj- ?
 Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
 Traditional log(N/x) tail, but when nearing zero freq, instead of
 going to +inf you do a nice round bump (with controlled
 height/location/sharpness) and drop down to -inf (or zero).

 I haven't used that technique, nor can I quote academic literature blessing
 it.  Nevertheless, what you're doing makes sense makes sense to me.

 Rationale is that - most good, discriminating terms are found in at
 least a certain percentage of your documents, but there are lots of
 mostly unique crapterms, which at some collection sizes stop being
 strictly unique and with IDF's help explode your scores.

 So you've designed a heuristic that allows you to filter a certain kind of
 noise.  It sounds a lot like how people tune length normalization to adapt to
 their document collections.  Many tuning techniques are corpus-specific.
 Whatever works, works!

 Marvin Humphrey


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: An IDF variation with penalty for very rare terms

2011-04-13 Thread Marvin Humphrey
On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote:
 Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- 
 ?
 Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
 Traditional log(N/x) tail, but when nearing zero freq, instead of
 going to +inf you do a nice round bump (with controlled
 height/location/sharpness) and drop down to -inf (or zero).
 
I haven't used that technique, nor can I quote academic literature blessing
it.  Nevertheless, what you're doing makes sense makes sense to me.

 Rationale is that - most good, discriminating terms are found in at
 least a certain percentage of your documents, but there are lots of
 mostly unique crapterms, which at some collection sizes stop being
 strictly unique and with IDF's help explode your scores.

So you've designed a heuristic that allows you to filter a certain kind of
noise.  It sounds a lot like how people tune length normalization to adapt to
their document collections.  Many tuning techniques are corpus-specific.
Whatever works, works!

Marvin Humphrey


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



An IDF variation with penalty for very rare terms

2011-04-12 Thread Earwin Burrfoot
Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- ?
Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
Traditional log(N/x) tail, but when nearing zero freq, instead of
going to +inf you do a nice round bump (with controlled
height/location/sharpness) and drop down to -inf (or zero).

Should be cool when doing cosine-measure(or something
comparable)-based document comparisons (eg. in a more like this
query, to mention Lucene at least once :) ), over dirty data.
Rationale is that - most good, discriminating terms are found in at
least a certain percentage of your documents, but there are lots of
mostly unique crapterms, which at some collection sizes stop being
strictly unique and with IDF's help explode your scores.

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org