Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation
Hi Sean, then I'll spare benchmarking the patch you sent, I'm glad you like the original version of the code. Sebastian Btw: our discussion here also helps me fill the pages of my diploma thesis with interesting stuff ;) Sean Owen schrieb: Nah, scratch that too. The simple version of this idea doesn't scale, and I was unable to get the current version to run at all significantly differently in speed. It's just good as-is. Now there is a non-distributed similarity implementation that matches what this does, which was the original question. Sean On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen sro...@gmail.com wrote: Actually scratch that patch I sent over. I see the trick now that makes the existing approach quite good. I think I can make a version that preserves that trick and still streamlines the processing. I will benchmark and report back if successful. On Wed, Apr 28, 2010 at 3:20 PM, Sean Owen sro...@gmail.com wrote: Sorry, typo, that's what I meant. yes the difference isn't *that* large! It may be worse in practice since you have a few users with very many prefs. It may also be beneficial to simply have one fewer phase and throw around less data. I will also try to benchmark since really that's the only way to know.
Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation
Hi Sean and Jeff, I looked at the formulas and I see your point that the computation is the same for input series with a mean of zero, thank you for the detailed feedback on this. However, I'm a little bit confused now, let me explain why I thought that this additional similarity implementation would be necessary: Three weeks ago, I submitted a patch computing the cosine item similarities via map-reduce (MAHOUT-362), which is how I currently fill my database table of precomputed item-item-similarities. Some days ago I needed to precompute lots of recommendations offline via org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want to connect my map-reduce code to the database. So I was in need of a similarity implementation that would give me the same results on the fly from a FileDataModel, which is how I came to create the CosineItemSimilarity implementation. So my point here would be that if the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not center the data, a non-distributed implementation of that way of computing the similarity should be available too, or the code in org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to center the data. You stated that centering the data adjusts for a user's tendency to rate high or low on average which is certainly necessary when you collect explicit ratings from your users. However in my usecase (a big online-shopping platform) I unfortunately do not have explicit ratings from the users, so I chose to interpret certain actions as ratings (I recall this is called implicit ratings in the literature), e.g. a user putting an item into their shopping basket as a rating of 3 or a user purchasing an item as a rating of 5, like suggested in the Making Recommendations chapter of Programming Collective Intelligence by T. Searagan. As far as I can judge (to be honest my mathematical knowledge is kinda limited) there are no different interpretations of the rating scala here as the values are fixed, so I thought that a centering of the data would not be necessary. Regards, Sebastian Sean Owen (JIRA) schrieb: [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-387: - Status: Resolved (was: Patch Available) Assignee: Sean Owen Fix Version/s: 0.3 Resolution: Won't Fix Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the case where the mean of each series is 0, the result is the same. The fastest way I know to see this is to just look at this form of the sample correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector sizes. This is just the cosine of the angle between x and y. One can argue whether forcing the data to be centered is right. I think it's a good thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also makes the computation simpler, and more consistent with Pearson (well, it makes it identical!). This has a good treatment: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise nice. I'd personally like to hear more about why to not center if there's an argument for it. Cosine item similarity implementation - Key: MAHOUT-387 URL: https://issues.apache.org/jira/browse/MAHOUT-387 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Sebastian Schelter Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-387.patch I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful.
Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation
Well it's not hard to add something like UncenteredCosineSimilarity for sure, I don't mind. It's actually a matter of configuring the superclass to center or not. But it's also easy to center the data in the M/R. I agree it makes little difference in your case, and the effect is subtle. I can add centering, to at least have it in the code for consistency -- in addition to adding the implementation above for completeness. While I'm at it I think we might be able to simplify this item-item computation. A straightforward alternative is something like this: 1. Compute item vectors (using Vector output, ideally) in one M/R 2M. For each item-item pair output both vectors 2R. With both Vectors in hand easily compute the cosine measure It's quite simple, and lends itself to dropping in a different reducer stage to implement different similarities, which is great. The question is performance. Say I have P prefs, U users and I items. Assume P/I prefs per item. The bottleneck of this stage is outputting I^2 vectors of size P/I, or about P*I output. What you're doing now bottlenecks in the last bit where you output for each user some data for every pair of items. That's coarsely U * (P/U)^2 = U*P^2 output. I think that P^2 is killer. Double-check my thinking but if you like it I think I can significantly simplify this job. I can maybe make a patch for you to try. On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter sebastian.schel...@zalando.de wrote: However, I'm a little bit confused now, let me explain why I thought that this additional similarity implementation would be necessary: Three weeks ago, I submitted a patch computing the cosine item similarities via map-reduce (MAHOUT-362), which is how I currently fill my database table of precomputed item-item-similarities. Some days ago I needed to precompute lots of recommendations offline via org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want to connect my map-reduce code to the database. So I was in need of a similarity implementation that would give me the same results on the fly from a FileDataModel, which is how I came to create the CosineItemSimilarity implementation. So my point here would be that if the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not center the data, a non-distributed implementation of that way of computing the similarity should be available too, or the code in org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to center the data. You stated that centering the data adjusts for a user's tendency to rate high or low on average which is certainly necessary when you collect explicit ratings from your users. However in my usecase (a big online-shopping platform) I unfortunately do not have explicit ratings from the users, so I chose to interpret certain actions as ratings (I recall this is called implicit ratings in the literature), e.g. a user putting an item into their shopping basket as a rating of 3 or a user purchasing an item as a rating of 5, like suggested in the Making Recommendations chapter of Programming Collective Intelligence by T. Searagan. As far as I can judge (to be honest my mathematical knowledge is kinda limited) there are no different interpretations of the rating scala here as the values are fixed, so I thought that a centering of the data would not be necessary. Regards, Sebastian Sean Owen (JIRA) schrieb: [ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-387: - Status: Resolved (was: Patch Available) Assignee: Sean Owen Fix Version/s: 0.3 Resolution: Won't Fix Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the case where the mean of each series is 0, the result is the same. The fastest way I know to see this is to just look at this form of the sample correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector sizes. This is just the cosine of the angle between x and y. One can argue whether forcing the data to be centered is right. I think it's a good thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also makes the computation simpler, and more consistent with Pearson (well, it makes it identical!). This has a good treatment: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise nice. I'd personally like to hear more about why to not center if there's an argument for it.
Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation
Nah, scratch that too. The simple version of this idea doesn't scale, and I was unable to get the current version to run at all significantly differently in speed. It's just good as-is. Now there is a non-distributed similarity implementation that matches what this does, which was the original question. Sean On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen sro...@gmail.com wrote: Actually scratch that patch I sent over. I see the trick now that makes the existing approach quite good. I think I can make a version that preserves that trick and still streamlines the processing. I will benchmark and report back if successful. On Wed, Apr 28, 2010 at 3:20 PM, Sean Owen sro...@gmail.com wrote: Sorry, typo, that's what I meant. yes the difference isn't *that* large! It may be worse in practice since you have a few users with very many prefs. It may also be beneficial to simply have one fewer phase and throw around less data. I will also try to benchmark since really that's the only way to know.
[jira] Updated: (MAHOUT-387) Cosine item similarity implementation
[ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-387: - Status: Resolved (was: Patch Available) Assignee: Sean Owen Fix Version/s: 0.3 Resolution: Won't Fix Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the case where the mean of each series is 0, the result is the same. The fastest way I know to see this is to just look at this form of the sample correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector sizes. This is just the cosine of the angle between x and y. One can argue whether forcing the data to be centered is right. I think it's a good thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also makes the computation simpler, and more consistent with Pearson (well, it makes it identical!). This has a good treatment: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise nice. I'd personally like to hear more about why to not center if there's an argument for it. Cosine item similarity implementation - Key: MAHOUT-387 URL: https://issues.apache.org/jira/browse/MAHOUT-387 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Sebastian Schelter Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-387.patch I needed to compute the cosine similarity between two items when running org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an implementation (did I overlook it maybe?) so I created my own. I want to share it here, in case you find it useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.