oops, forgot the log So... idf weighted preference value = item preference value * log (number of all items/number of users with specific item pref)
items 1 0 0 users 1 0 0 1 1 0 freq 3 1 0 #users/freq 3/3 3/1 0 So the idf weighted values 1*log(1) 0 0 1*log(1) 0 0 1*log(1) 1*log(3) 0 sum 0 log(3) 0 so the IDF weighted matrix is items 0 0 0 users 0 0 0 0 0.48 0 This results in no information for universally preferred items, which is indeed what I was looking for. It looks like this should also work for other values or explicit preferences--item prices, ratings, etc.. Intuition says this will result in a lower precision related cross validation measure since you are discounting the obvious recommendations. I have no experience with measuring something like this, any you have would be appreciated. On Feb 5, 2013, at 12:33 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: On Tue, Feb 5, 2013 at 11:29 AM, Pat Ferrel <pat.fer...@gmail.com> wrote: > I think you meant: "Human relatedness decays much slower than item > popularity." > Yes. Oops. > So to make sure I understand the implications of using IDF… For > boolean/implicit preferences the sum of all prefs (after weighting) for a > single item over all users will always be 1 or 0. This no matter whether > the frequency is 1M or 1. > I don't see this. For things that occur once for N users, the sum is log N. For items that occur for every user, the sum will be 0. Another approach would be to do some kind of outlier detection and remove > those users. Down-sampling and proper thresholding handles this. Crazy users and crawlers are relatively rare and each get only a single vote. This makes them immaterial. Looking at some types of web data you will see crawlers as outliers mucking > up impression or click-thru data. > You will see them, but they shouldn't matter. > > On Feb 2, 2013, at 1:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > On Sat, Feb 2, 2013 at 1:03 PM, Pat Ferrel <pat.fer...@gmail.com> wrote: > >> Indeed, please elaborate. Not sure what you mean by "this is an important >> effect" >> >> Do you disagree with what I said re temporal decay? >> > > No. I agree with it. Human relatedness decays much more quickly than item > popularity. > > I was extending this. Down-sampling should make use of this observation to > try to preserve time coincidence in the resulting dataset. > > >> As to downsampling or rather reweighting outliers in popular items and/or >> active users--It's another interesting question. Does the fact that we > both >> like puppies and motherhood make us in any real way similar? I'm quite >> interested in ways to account for this. I've seen what is done to > normalize >> ratings from different users based on whether they tend to rate high or >> low. I'm interested in any papers talking about the super active user or >> super popular items. >> > > I view downsampling as a necessary evil when using cooccurrence based > algorithms. This only applies to prolific users. > > For items, I tend to use simple IDF weightings. This gives very low > weights to ubiquitous preferences. > > > >> >> Another subject of interest is the question; is it possible to create a >> blend of recommenders based on their performance on long tail items. > > > Absolutely this is possible and it is a great thing to do. Ensembles are > all the fashion rage and for good reason. See all the top players in the > Netflix challenge. > > >> For instance if the precision of a recommender (just considering the >> item-item similarity for the present) as a function of item popularity >> decreases towards the long tail, is it possible that one type of >> recommender does better than another--do the distributions cross? This >> would suggest a blending strategy based on how far out the long tail you >> are when calculating similar items. > > > Yeah... but you can't tell very well due to the low counts. > >