On Wed, Oct 28, 2009 at 3:06 PM, Gökhan Çapan <gkhn...@gmail.com> wrote: > similarities at once. For example, loglikelihood similarity does something > with 4 quantities of two items: > item1item2 : # of users who rated both item1 and item2 > !item1item2 : # of users who rated item2 but not item1 > item1!item2 : # of users who rated item1 but not item2 > !item1!item2 : # of users who didn't rate item1 or item2 > It is easy to find these quantities with mapreduce. When I finish > implementing it, I will share it with a patch.
Yes you can easily compute these offline, and then use them in an alternate implementation of LogLikelihoodSimilarity to produce fast results. The only possible issue I see is loading all these values into memory, since it grows as the square of the number of items. You could store only 'most-popular' pairs in memory and fall back to a normal computation when the pair hasn't been precomputed. This amounts to caching, and might be done as easily by simply writing a caching wrapper around LogLikelihoodSimilarity, which caches with the "Cache" class, which will do a nice job of limiting the memory usage and removing unpopular entries. > There is also another measure that we talked about in our discussion. Ted > said dot product can yield very good results and it can be weighted with > inverse user frequency. > I've implemented 3 versions of that approach to see how good the results > are: Agree, this is what PearsonCorrelationSimilarity basically does -- really it's also an implementation of the cosine-measure similarity, since the data is 'centered' to a mean of zero. And I believe this is what Ted is referring to by dot product. The framework also already has a transformation for inverse user frequency. But yeah I think your point is doing this at scale, all at once, with a big matrix operation, via mapreduce, for all pairs of items. I agree this is a useful thing. The results, when precomputed this way, make the on-line similarity computation very fast. You could feed it in to any item-based recommender that way. Or you could rewrite the rest of the recommender process in mapreduce, continue that way. > First, I have worked on 1 month user log of an e-commerce site, data set is > very sparse, but there are some items that are very popular, that don't fit > to the general characteristics of data. > > I've tried 3 versions of that computation: > 1- iuf=log(N/1+uf) ; N is total number of users here Yep > score item1,item2 = cosine similarity of item1 item2 vectors whose elements > are weighted with iuf that is described above > the dot product of two items is normalized > with euclidian length of items(i.e. > score=dotproduct/length(item1)*length(item2)) PS yeah this is the cosine measure similarity, and also reduces to Pearson when the data has mean zero. > 2-iuf is same as 1, score is just dot product I think it's problematic to not normalize the dot product. I wouldn't expect this to work well, but let's see your results below -- > 3- iuf=N/1+uf, and score is just dot product Valid, but I do think the log() is useful to keep here. > As a result, I think it will be good it will be good if we found a way to > smooth item vectors to make 1st way yield better results. For example, with > filling some missing data. And it will be good if we had a way to merge 2nd > and 3rd ways. I mean, another similarity measure that find hidden -hard to > find- similar items like 2, and gives the results as 3rd way also. "Filling missing data" -- yes this is what the "PreferenceInferrer" does in the framework. I am not convinced it is very useful, but lets you synthesize missing data. I also suspect 1 is the way to go. It is definitely the conventional approach and the one already implemented by the framework.