> Yes you can easily compute these offline, and then use them in an > alternate implementation of LogLikelihoodSimilarity to produce fast > results. The only possible issue I see is loading all these values > into memory, since it grows as the square of the number of items. > > You could store only 'most-popular' pairs in memory and fall back to a > normal computation when the pair hasn't been precomputed. This amounts > to caching, and might be done as easily by simply writing a caching > wrapper around LogLikelihoodSimilarity, which caches with the "Cache" > class, which will do a nice job of limiting the memory usage and > removing unpopular entries.
That sounds interesting. > > > > There is also another measure that we talked about in our discussion. Ted > > said dot product can yield very good results and it can be weighted with > > inverse user frequency. > > I've implemented 3 versions of that approach to see how good the results > > are: > > Agree, this is what PearsonCorrelationSimilarity basically does -- > really it's also an implementation of the cosine-measure similarity, > since the data is 'centered' to a mean of zero. And I believe this is > what Ted is referring to by dot product. The framework also already > has a transformation for inverse user frequency. > > But yeah I think your point is doing this at scale, all at once, with > a big matrix operation, via mapreduce, for all pairs of items. I agree > this is a useful thing. The results, when precomputed this way, make > the on-line similarity computation very fast. You could feed it in to > any item-based recommender that way. > > Or you could rewrite the rest of the recommender process in mapreduce, > continue that way. > > > First, I have worked on 1 month user log of an e-commerce site, data set > is > > very sparse, but there are some items that are very popular, that don't > fit > > to the general characteristics of data. > > > > I've tried 3 versions of that computation: > > 1- iuf=log(N/1+uf) ; N is total number of users here > > Yep > > > score item1,item2 = cosine similarity of item1 item2 vectors whose > elements > > are weighted with iuf that is described above > > the dot product of two items is > normalized > > with euclidian length of items(i.e. > > score=dotproduct/length(item1)*length(item2)) > > PS yeah this is the cosine measure similarity, and also reduces to > Pearson when the data has mean zero. > > > > 2-iuf is same as 1, score is just dot product > > I think it's problematic to not normalize the dot product. I wouldn't > expect this to work well, but let's see your results below -- > > > > 3- iuf=N/1+uf, and score is just dot product > > Valid, but I do think the log() is useful to keep here. > > > > As a result, I think it will be good it will be good if we found a way to > > smooth item vectors to make 1st way yield better results. For example, > with > > filling some missing data. And it will be good if we had a way to merge > 2nd > > and 3rd ways. I mean, another similarity measure that find hidden -hard > to > > find- similar items like 2, and gives the results as 3rd way also. > > "Filling missing data" -- yes this is what the "PreferenceInferrer" > does in the framework. I am not convinced it is very useful, but lets > you synthesize missing data. > > I also suspect 1 is the way to go. It is definitely the conventional > approach and the one already implemented by the framework. > For your comments, I should say it is very application-dependent. I used to think same as you, but after some experiments, I saw all give different, interesting results depending on data set. Thanks for sharing your ideas. -- Gökhan Çapan