Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-30 Thread Sebastian Schelter
Hi Sean,

then I'll spare benchmarking the patch you sent, I'm glad you like the
original version of the code.

Sebastian

Btw: our discussion here also helps me fill the pages of my diploma
thesis with interesting stuff ;)



Sean Owen schrieb:
 Nah, scratch that too. The simple version of this idea doesn't scale,
 and I was unable to get the current version to run at all
 significantly differently in speed. It's just good as-is.

 Now there is a non-distributed similarity implementation that matches
 what this does, which was the original question.

 Sean

 On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen sro...@gmail.com wrote:
   
 Actually scratch that patch I sent over. I see the trick now that
 makes the existing approach quite good. I think I can make a version
 that preserves that trick and still streamlines the processing. I will
 benchmark and report back if successful.

 On Wed, Apr 28, 2010 at 3:20 PM, Sean Owen sro...@gmail.com wrote:
 
 Sorry, typo, that's what I meant. yes the difference isn't *that* large!
 It may be worse in practice since you have a few users with very many prefs.
 It may also be beneficial to simply have one fewer phase and throw
 around less data. I will also try to benchmark since really that's the
 only way to know.

   



Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sebastian Schelter
Hi Sean and Jeff,

I looked at the formulas and I see your point that the computation is
the same for input series with a mean of zero, thank you for the
detailed feedback on this.

However, I'm a little bit confused now, let me explain why I thought
that this additional similarity implementation would be necessary: Three
weeks ago, I submitted a patch computing the cosine item similarities
via map-reduce (MAHOUT-362), which is how I currently fill my database
table of precomputed item-item-similarities. Some days ago I needed to
precompute lots of recommendations offline via
org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
to connect my map-reduce code to the database. So I was in need of a
similarity implementation that would give me the same results on the fly
from a FileDataModel, which is how I came to create the
CosineItemSimilarity implementation. So my point here would be that if
the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
center the data, a non-distributed implementation of that way of
computing the similarity should be available too, or the code in
org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
center the data.

You stated that centering the data adjusts for a user's tendency to
rate high or low on average which is certainly necessary when you
collect explicit ratings from your users. However in my usecase (a big
online-shopping platform) I unfortunately do not have explicit ratings
from the users, so I chose to interpret certain actions as ratings (I
recall this is called implicit ratings in the literature), e.g. a user
putting an item into their shopping basket as a rating of 3 or a user
purchasing an item as a rating of 5, like suggested in the Making
Recommendations chapter of Programming Collective Intelligence by T.
Searagan. As far as I can judge (to be honest my mathematical knowledge
is kinda limited) there are no different interpretations of the rating
scala here as the values are fixed, so I thought that a centering of the
data would not be necessary.

Regards,
Sebastian

Sean Owen (JIRA) schrieb:
  [ 
 https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 Sean Owen updated MAHOUT-387:
 -

Status: Resolved  (was: Patch Available)
  Assignee: Sean Owen
 Fix Version/s: 0.3
Resolution: Won't Fix

 Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In 
 the case where the mean of each series is 0, the result is the same. The 
 fastest way I know to see this is to just look at this form of the sample 
 correlation: 
 http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png 
 ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. 
 You're left with sum(xi*yi) in the numerator, which is the dot product, and 
 sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector 
 sizes. This is just the cosine of the angle between x and y.

 One can argue whether forcing the data to be centered is right. I think it's 
 a good thing in all cases. It adjusts for a user's tendency to rate high or 
 low on average. It also makes the computation simpler, and more consistent 
 with Pearson (well, it makes it identical!). This has a good treatment:
 http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation

 Only for this reason I'd mark this as won't-fix for the moment; the patch is 
 otherwise nice. I'd personally like to hear more about why to not center if 
 there's an argument for it.

   
 Cosine item similarity implementation
 -

 Key: MAHOUT-387
 URL: https://issues.apache.org/jira/browse/MAHOUT-387
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Sebastian Schelter
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-387.patch


 I needed to compute the cosine similarity between two items when running 
 org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an 
 implementation (did I overlook it maybe?) so I created my own. I want to 
 share it here, in case you find it useful.
 

   



Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sean Owen
Well it's not hard to add something like UncenteredCosineSimilarity
for sure, I don't mind. It's actually a matter of configuring the
superclass to center or not.

But it's also easy to center the data in the M/R. I agree it makes
little difference in your case, and the effect is subtle. I can add
centering, to at least have it in the code for consistency -- in
addition to adding the implementation above for completeness.

While I'm at it I think we might be able to simplify this item-item
computation. A straightforward alternative is something like this:

1. Compute item vectors (using Vector output, ideally) in one M/R
2M. For each item-item pair output both vectors
2R. With both Vectors in hand easily compute the cosine measure

It's quite simple, and lends itself to dropping in a different reducer
stage to implement different similarities, which is great.

The question is performance. Say I have P prefs, U users and I items.
Assume P/I prefs per item. The bottleneck of this stage is outputting
I^2 vectors of size P/I, or about P*I output.

What you're doing now bottlenecks in the last bit where you output for
each user some data for every pair of items. That's coarsely U *
(P/U)^2 = U*P^2 output. I think that P^2 is killer.

Double-check my thinking but if you like it I think I can
significantly simplify this job. I can maybe make a patch for you to
try.

On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter
sebastian.schel...@zalando.de wrote:
 However, I'm a little bit confused now, let me explain why I thought
 that this additional similarity implementation would be necessary: Three
 weeks ago, I submitted a patch computing the cosine item similarities
 via map-reduce (MAHOUT-362), which is how I currently fill my database
 table of precomputed item-item-similarities. Some days ago I needed to
 precompute lots of recommendations offline via
 org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
 to connect my map-reduce code to the database. So I was in need of a
 similarity implementation that would give me the same results on the fly
 from a FileDataModel, which is how I came to create the
 CosineItemSimilarity implementation. So my point here would be that if
 the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
 center the data, a non-distributed implementation of that way of
 computing the similarity should be available too, or the code in
 org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
 center the data.

 You stated that centering the data adjusts for a user's tendency to
 rate high or low on average which is certainly necessary when you
 collect explicit ratings from your users. However in my usecase (a big
 online-shopping platform) I unfortunately do not have explicit ratings
 from the users, so I chose to interpret certain actions as ratings (I
 recall this is called implicit ratings in the literature), e.g. a user
 putting an item into their shopping basket as a rating of 3 or a user
 purchasing an item as a rating of 5, like suggested in the Making
 Recommendations chapter of Programming Collective Intelligence by T.
 Searagan. As far as I can judge (to be honest my mathematical knowledge
 is kinda limited) there are no different interpretations of the rating
 scala here as the values are fixed, so I thought that a centering of the
 data would not be necessary.

 Regards,
 Sebastian

 Sean Owen (JIRA) schrieb:
      [ 
 https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 Sean Owen updated MAHOUT-387:
 -

            Status: Resolved  (was: Patch Available)
          Assignee: Sean Owen
     Fix Version/s: 0.3
        Resolution: Won't Fix

 Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In 
 the case where the mean of each series is 0, the result is the same. The 
 fastest way I know to see this is to just look at this form of the sample 
 correlation: 
 http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png 
 ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. 
 You're left with sum(xi*yi) in the numerator, which is the dot product, and 
 sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector 
 sizes. This is just the cosine of the angle between x and y.

 One can argue whether forcing the data to be centered is right. I think it's 
 a good thing in all cases. It adjusts for a user's tendency to rate high or 
 low on average. It also makes the computation simpler, and more consistent 
 with Pearson (well, it makes it identical!). This has a good treatment:
 http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation

 Only for this reason I'd mark this as won't-fix for the moment; the patch is 
 otherwise nice. I'd personally like to hear more about why to not center if 
 there's an argument for it.


 

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sean Owen
Nah, scratch that too. The simple version of this idea doesn't scale,
and I was unable to get the current version to run at all
significantly differently in speed. It's just good as-is.

Now there is a non-distributed similarity implementation that matches
what this does, which was the original question.

Sean

On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen sro...@gmail.com wrote:
 Actually scratch that patch I sent over. I see the trick now that
 makes the existing approach quite good. I think I can make a version
 that preserves that trick and still streamlines the processing. I will
 benchmark and report back if successful.

 On Wed, Apr 28, 2010 at 3:20 PM, Sean Owen sro...@gmail.com wrote:
 Sorry, typo, that's what I meant. yes the difference isn't *that* large!
 It may be worse in practice since you have a few users with very many prefs.
 It may also be beneficial to simply have one fewer phase and throw
 around less data. I will also try to benchmark since really that's the
 only way to know.




[jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-387:
-

   Status: Resolved  (was: Patch Available)
 Assignee: Sean Owen
Fix Version/s: 0.3
   Resolution: Won't Fix

Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In 
the case where the mean of each series is 0, the result is the same. The 
fastest way I know to see this is to just look at this form of the sample 
correlation: 
http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... 
and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're 
left with sum(xi*yi) in the numerator, which is the dot product, and 
sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector 
sizes. This is just the cosine of the angle between x and y.

One can argue whether forcing the data to be centered is right. I think it's a 
good thing in all cases. It adjusts for a user's tendency to rate high or low 
on average. It also makes the computation simpler, and more consistent with 
Pearson (well, it makes it identical!). This has a good treatment:
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation

Only for this reason I'd mark this as won't-fix for the moment; the patch is 
otherwise nice. I'd personally like to hear more about why to not center if 
there's an argument for it.

 Cosine item similarity implementation
 -

 Key: MAHOUT-387
 URL: https://issues.apache.org/jira/browse/MAHOUT-387
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Sebastian Schelter
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-387.patch


 I needed to compute the cosine similarity between two items when running 
 org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an 
 implementation (did I overlook it maybe?) so I created my own. I want to 
 share it here, in case you find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.