subject:"\[jira\] Updated\: \(MAHOUT\-387\) Cosine item similarity implementation"

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-30 Thread Sebastian Schelter

Hi Sean,

then I'll spare benchmarking the patch you sent, I'm glad you like the
original version of the code.

Sebastian

Btw: our discussion here also helps me fill the pages of my diploma
thesis with interesting stuff ;)



Sean Owen schrieb:
 Nah, scratch that too. The simple version of this idea doesn't scale,
 and I was unable to get the current version to run at all
 significantly differently in speed. It's just good as-is.

 Now there is a non-distributed similarity implementation that matches
 what this does, which was the original question.

 Sean

 On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen sro...@gmail.com wrote:
   
 Actually scratch that patch I sent over. I see the trick now that
 makes the existing approach quite good. I think I can make a version
 that preserves that trick and still streamlines the processing. I will
 benchmark and report back if successful.

 On Wed, Apr 28, 2010 at 3:20 PM, Sean Owen sro...@gmail.com wrote:
 
 Sorry, typo, that's what I meant. yes the difference isn't *that* large!
 It may be worse in practice since you have a few users with very many prefs.
 It may also be beneficial to simply have one fewer phase and throw
 around less data. I will also try to benchmark since really that's the
 only way to know.

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sebastian Schelter

Hi Sean and Jeff,

I looked at the formulas and I see your point that the computation is
the same for input series with a mean of zero, thank you for the
detailed feedback on this.

However, I'm a little bit confused now, let me explain why I thought
that this additional similarity implementation would be necessary: Three
weeks ago, I submitted a patch computing the cosine item similarities
via map-reduce (MAHOUT-362), which is how I currently fill my database
table of precomputed item-item-similarities. Some days ago I needed to
precompute lots of recommendations offline via
org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
to connect my map-reduce code to the database. So I was in need of a
similarity implementation that would give me the same results on the fly
from a FileDataModel, which is how I came to create the
CosineItemSimilarity implementation. So my point here would be that if
the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
center the data, a non-distributed implementation of that way of
computing the similarity should be available too, or the code in
org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
center the data.

You stated that centering the data adjusts for a user's tendency to
rate high or low on average which is certainly necessary when you
collect explicit ratings from your users. However in my usecase (a big
online-shopping platform) I unfortunately do not have explicit ratings
from the users, so I chose to interpret certain actions as ratings (I
recall this is called implicit ratings in the literature), e.g. a user
putting an item into their shopping basket as a rating of 3 or a user
purchasing an item as a rating of 5, like suggested in the Making
Recommendations chapter of Programming Collective Intelligence by T.
Searagan. As far as I can judge (to be honest my mathematical knowledge
is kinda limited) there are no different interpretations of the rating
scala here as the values are fixed, so I thought that a centering of the
data would not be necessary.

Regards,
Sebastian

Sean Owen (JIRA) schrieb:
[
https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated MAHOUT-387:
-

Status: Resolved (was: Patch Available)
Assignee: Sean Owen
Fix Version/s: 0.3
Resolution: Won't Fix

Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In
the case where the mean of each series is 0, the result is the same. The
fastest way I know to see this is to just look at this form of the sample
correlation:
http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png
... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0.
You're left with sum(xi*yi) in the numerator, which is the dot product, and
sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector
sizes. This is just the cosine of the angle between x and y.

One can argue whether forcing the data to be centered is right. I think it's
a good thing in all cases. It adjusts for a user's tendency to rate high or
low on average. It also makes the computation simpler, and more consistent
with Pearson (well, it makes it identical!). This has a good treatment:
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation

Only for this reason I'd mark this as won't-fix for the moment; the patch is
otherwise nice. I'd personally like to hear more about why to not center if
there's an argument for it.

Cosine item similarity implementation
-

Key: MAHOUT-387
URL: https://issues.apache.org/jira/browse/MAHOUT-387
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Sebastian Schelter
Assignee: Sean Owen
Fix For: 0.3

Attachments: MAHOUT-387.patch

I needed to compute the cosine similarity between two items when running
org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an
implementation (did I overlook it maybe?) so I created my own. I want to
share it here, in case you find it useful.

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sean Owen

Well it's not hard to add something like UncenteredCosineSimilarity
for sure, I don't mind. It's actually a matter of configuring the
superclass to center or not.

But it's also easy to center the data in the M/R. I agree it makes
little difference in your case, and the effect is subtle. I can add
centering, to at least have it in the code for consistency -- in
addition to adding the implementation above for completeness.

While I'm at it I think we might be able to simplify this item-item
computation. A straightforward alternative is something like this:

1. Compute item vectors (using Vector output, ideally) in one M/R
2M. For each item-item pair output both vectors
2R. With both Vectors in hand easily compute the cosine measure

It's quite simple, and lends itself to dropping in a different reducer
stage to implement different similarities, which is great.

The question is performance. Say I have P prefs, U users and I items.
Assume P/I prefs per item. The bottleneck of this stage is outputting
I^2 vectors of size P/I, or about P*I output.

What you're doing now bottlenecks in the last bit where you output for
each user some data for every pair of items. That's coarsely U *
(P/U)^2 = U*P^2 output. I think that P^2 is killer.

Double-check my thinking but if you like it I think I can
significantly simplify this job. I can maybe make a patch for you to
try.

On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter
sebastian.schel...@zalando.de wrote:
However, I'm a little bit confused now, let me explain why I thought
that this additional similarity implementation would be necessary: Three
weeks ago, I submitted a patch computing the cosine item similarities
via map-reduce (MAHOUT-362), which is how I currently fill my database
table of precomputed item-item-similarities. Some days ago I needed to
precompute lots of recommendations offline via
org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
to connect my map-reduce code to the database. So I was in need of a
similarity implementation that would give me the same results on the fly
from a FileDataModel, which is how I came to create the
CosineItemSimilarity implementation. So my point here would be that if
the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
center the data, a non-distributed implementation of that way of
computing the similarity should be available too, or the code in
org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
center the data.

Regards,
Sebastian

Sean Owen (JIRA) schrieb:
[
https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated MAHOUT-387:
-

Status: Resolved (was: Patch Available)
Assignee: Sean Owen
Fix Version/s: 0.3
Resolution: Won't Fix

Only for this reason I'd mark this as won't-fix for the moment; the patch is
otherwise nice. I'd personally like to hear more about why to not center if
there's an argument for it.

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sean Owen

Nah, scratch that too. The simple version of this idea doesn't scale,
and I was unable to get the current version to run at all
significantly differently in speed. It's just good as-is.

Now there is a non-distributed similarity implementation that matches
what this does, which was the original question.

Sean

On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen sro...@gmail.com wrote:
 Actually scratch that patch I sent over. I see the trick now that
 makes the existing approach quite good. I think I can make a version
 that preserves that trick and still streamlines the processing. I will
 benchmark and report back if successful.

 On Wed, Apr 28, 2010 at 3:20 PM, Sean Owen sro...@gmail.com wrote:
 Sorry, typo, that's what I meant. yes the difference isn't *that* large!
 It may be worse in practice since you have a few users with very many prefs.
 It may also be beneficial to simply have one fewer phase and throw
 around less data. I will also try to benchmark since really that's the
 only way to know.

[jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-27 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated MAHOUT-387:
-

Status: Resolved (was: Patch Available)
Assignee: Sean Owen
Fix Version/s: 0.3
Resolution: Won't Fix

Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In
the case where the mean of each series is 0, the result is the same. The
fastest way I know to see this is to just look at this form of the sample
correlation:
http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ...
and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're
left with sum(xi*yi) in the numerator, which is the dot product, and
sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector
sizes. This is just the cosine of the angle between x and y.

One can argue whether forcing the data to be centered is right. I think it's a
good thing in all cases. It adjusts for a user's tendency to rate high or low
on average. It also makes the computation simpler, and more consistent with
Pearson (well, it makes it identical!). This has a good treatment:
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation

Only for this reason I'd mark this as won't-fix for the moment; the patch is
otherwise nice. I'd personally like to hear more about why to not center if
there's an argument for it.

Cosine item similarity implementation
-

Attachments: MAHOUT-387.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

[jira] Updated: (MAHOUT-387) Cosine item similarity implementation

5 matches

Site Navigation

Mail list logo

Footer information