Re: Offline computation of item-item similarities

Gökhan Çapan Thu, 29 Oct 2009 06:13:13 -0700

> Yes you can easily compute these offline, and then use them in an
> alternate implementation of LogLikelihoodSimilarity to produce fast
> results. The only possible issue I see is loading all these values
> into memory, since it grows as the square of the number of items.
>
> You could store only 'most-popular' pairs in memory and fall back to a
> normal computation when the pair hasn't been precomputed. This amounts
> to caching, and might be done as easily by simply writing a caching
> wrapper around LogLikelihoodSimilarity, which caches with the "Cache"
> class, which will do a nice job of limiting the memory usage and
> removing unpopular entries.



That sounds interesting.

>
>
> > There is also another measure that we talked about in our discussion. Ted
> > said dot product can yield very good results and it can be weighted with
> > inverse user frequency.
> > I've implemented 3 versions of that approach to see how good the results
> > are:
>
> Agree, this is what PearsonCorrelationSimilarity basically does --
> really it's also an implementation of the cosine-measure similarity,
> since the data is 'centered' to a mean of zero. And I believe this is
> what Ted is referring to by dot product. The framework also already
> has a transformation for inverse user frequency.
>
> But yeah I think your point is doing this at scale, all at once, with
> a big matrix operation, via mapreduce, for all pairs of items. I agree
> this is a useful thing. The results, when precomputed this way, make
> the on-line similarity computation very fast. You could feed it in to
> any item-based recommender that way.
>
> Or you could rewrite the rest of the recommender process in mapreduce,
> continue that way.
>
> > First, I have worked on 1 month user log of an e-commerce site, data set
> is
> > very sparse, but there are some items that are very popular, that don't
> fit
> > to the general characteristics of data.
> >
> > I've tried 3 versions of that computation:
> > 1- iuf=log(N/1+uf)  ; N is total number of users here
>
> Yep
>
> > score item1,item2 = cosine similarity of item1 item2 vectors whose
> elements
> > are weighted with iuf that is described above
> >                                 the dot product of two items is
> normalized
> > with euclidian length of items(i.e.
> > score=dotproduct/length(item1)*length(item2))
>
> PS yeah this is the cosine measure similarity, and also reduces to
> Pearson when the data has mean zero.
>
>
> > 2-iuf is same as 1, score is just dot product
>
> I think it's problematic to not normalize the dot product. I wouldn't
> expect this to work well, but let's see your results below --
>
>
> > 3- iuf=N/1+uf, and score is just dot product
>
> Valid, but I do think the log() is useful to keep here.
>
>
> > As a result, I think it will be good it will be good if we found a way to
> > smooth item vectors to make 1st way yield better results. For example,
> with
> > filling some missing data. And it will be good if we had a way to merge
> 2nd
> > and 3rd ways. I mean, another similarity measure that find hidden -hard
> to
> > find- similar items like 2, and gives the results as 3rd way also.
>
> "Filling missing data" -- yes this is what the "PreferenceInferrer"
> does in the framework. I am not convinced it is very useful, but lets
> you synthesize missing data.
>
> I also suspect 1 is the way to go. It is definitely the conventional
> approach and the one already implemented by the framework.
>

For your comments, I should say it is very application-dependent. I used to
think same as you, but after some experiments, I saw all give different,
interesting results depending on data set. Thanks for sharing your ideas.

-- 
Gökhan Çapan

Re: Offline computation of item-item similarities

Reply via email to