oops, forgot the log 

So...
idf weighted preference value = item preference value * log (number of all 
items/number of users with specific item pref)

                         items
                        1   0   0
users              1   0   0
                        1   1   0

freq                 3   1   0
#users/freq 3/3 3/1 0

So the idf weighted values
         1*log(1)       0                 0
         1*log(1)       0                 0
         1*log(1)   1*log(3)        0
sum    0              log(3)           0

so the IDF weighted matrix is 

                         items
                        0    0    0
users              0    0    0
                        0 0.48  0

This results in no information for universally preferred items, which is indeed 
what I was looking for. It looks like this should also work for other values or 
explicit preferences--item prices, ratings, etc..

Intuition says this will result in a lower precision related cross validation 
measure since you are discounting the obvious recommendations. I have no 
experience with measuring something like this, any you have would be 
appreciated. 
  
On Feb 5, 2013, at 12:33 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

On Tue, Feb 5, 2013 at 11:29 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> I think you meant: "Human relatedness decays much slower than item
> popularity."
> 

Yes.  Oops.


> So to make sure I understand the implications of using IDF…  For
> boolean/implicit preferences the sum of all prefs (after weighting) for a
> single item over all users will always be 1 or 0. This no matter whether
> the frequency is 1M or 1.
> 

I don't see this.

For things that occur once for N users, the sum is log N.  For items that
occur for every user, the sum will be 0.

Another approach would be to do some kind of outlier detection and remove
> those users.


Down-sampling and proper thresholding handles this.   Crazy users and
crawlers are relatively rare and each get only a single vote.  This makes
them immaterial.

Looking at some types of web data you will see crawlers as outliers mucking
> up impression or click-thru data.
> 

You will see them, but they shouldn't matter.


> 
> On Feb 2, 2013, at 1:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> On Sat, Feb 2, 2013 at 1:03 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> 
>> Indeed, please elaborate. Not sure what you mean by "this is an important
>> effect"
>> 
>> Do you disagree with what I said re temporal decay?
>> 
> 
> No.  I agree with it.  Human relatedness decays much more quickly than item
> popularity.
> 
> I was extending this.  Down-sampling should make use of this observation to
> try to preserve time coincidence in the resulting dataset.
> 
> 
>> As to downsampling or rather reweighting outliers in popular items and/or
>> active users--It's another interesting question. Does the fact that we
> both
>> like puppies and motherhood make us in any real way similar? I'm quite
>> interested in ways to account for this. I've seen what is done to
> normalize
>> ratings from different users based on whether they tend to rate high or
>> low. I'm interested in any papers talking about the super active user or
>> super popular items.
>> 
> 
> I view downsampling as a necessary evil when using cooccurrence based
> algorithms.  This only applies to prolific users.
> 
> For items, I tend to use simple IDF weightings.  This gives very low
> weights to ubiquitous preferences.
> 
> 
> 
>> 
>> Another subject of interest is the question; is it possible to create a
>> blend of recommenders based on their performance on long tail items.
> 
> 
> Absolutely this is possible and it is a great thing to do.  Ensembles are
> all the fashion rage and for good reason.  See all the top players in the
> Netflix challenge.
> 
> 
>> For instance if the precision of a recommender (just considering the
>> item-item similarity for the present) as a function of item popularity
>> decreases towards the long tail, is it possible that one type of
>> recommender does better than another--do the distributions cross? This
>> would suggest a blending strategy based on how far out the long tail you
>> are when calculating similar items.
> 
> 
> Yeah... but you can't tell very well due to the low counts.
> 
> 

Reply via email to