[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794036#action_12794036 ] Ankur commented on MAHOUT-103: -- I skimmed through your version and what's present in .item package. Few immediate things that come to mind are 1. Moving to AbstractJob 2. Re-factoring to separate map,reduce and job classes. Personally I hate that coz it the code base just bloats when number of M/R jobs increase. I have been trying to setup my idea using IntelliJ.codestyle.xml provided by mahout cwiki. I placed the file under idea_home/config/codeStyles and restarted idea but it still does not an import option in File-Settings-Code Style. Idea shows following messages in the back ground Field not copied JAVA_INDENT_OPTIONS Field not copied JSP_INDENT_OPTIONS Field not copied XML_INDENT_OPTIONS Field not copied OTHER_INDENT_OPTIONS Field not copied FIELD_TYPE_TO_NAME Field not copied STATIC_FIELD_TYPE_TO_NAME Field not copied PARAMETER_TYPE_TO_NAME Field not copied LOCAL_VARIABLE_TYPE_TO_NAME Field not copied PACKAGES_TO_USE_IMPORT_ON_DEMAND Field not copied IMPORT_LAYOUT_TABLE I am using v 8.1.4 Once i set this up, I am gonna look at your version and what's there in .item package Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: MAHOUT-103.patch, mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794064#action_12794064 ] Sean Owen commented on MAHOUT-103: -- I'm not worried about integrating with AbstractJob just this second but at some point all our MR jobs ought to have some consistent approach. I also don't feel too strongly about Mapper/Reducers as inner classes. It's the same amount of code either way, so I don't think that's the difference, and I prefer the clarify of top-level classes in general unless a class is truly intricately associated to another class. But don't change that now. Am I understanding you're roughly OK with this form of the patch, and I should submit, or..? Looking at .item can come later too. I'm mostly thinking of small/local changes at the moment. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: MAHOUT-103.patch, mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794133#action_12794133 ] Ankur commented on MAHOUT-103: -- Your changes don't look too mutating and yes roughly speaking I am ok. Since we are talking about committing this I would like to say that I tested this for correctness on very small hand-coded data-set and then ran it on netflix-data. However I couldn't verify its correctness over netflix data though I am pretty confident it works correctly. That is why I was hoping to have a couple of unit test:- 1. To verify that similar items are identified correctly. 2. None of the seen items are recommended for a user. But since this is not the final version, can you suggest any other approach to be 100% sure of correctness? I don't want something to be committed only to discover a silly issue later just because we didn't take extra care. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: MAHOUT-103.patch, mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793617#action_12793617 ] Sean Owen commented on MAHOUT-103: -- My only significant concern is that this overlaps a lot with what I've already committed under .item. However I'm not against committing this as this part of the code is still in an experimental phase where we should have room to play and see what 'sticks'. There is I think a lot of small changes that need to be made here to fit code style, etc. I'm willing to commit after those sorts of things are addressed, and also volunteer to do them. If you're reasonably willing to evolve and integrate this code along with the surrounding code, it's fine by me to get this in and continue working that way. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793628#action_12793628 ] Ankur commented on MAHOUT-103: -- Evolving the code to integrate better with the existing stuff is fine with me. I am in general ok with throwing away code if it can be replaced by existing stuff that is better. However, I don't think its a good idea to try to come up with a unified approach of generating hadoop based recommendations. I am afraid we'll create more problems than we'd solve. I see recommendations in hadoop world as the following linear chain of M/R jobs Data-Formatting -- Data Filter -- Core Recommender-- Post Processor The last 2 jobs can themselves be comprised of 1 or more M/R jobs. There is I think . Let me come up with the unit-tests and code documentation. After that you can start doing the changes. Thanks a lot for help. Appreciate that :-) BTW did u have a chance to actually run it on netflix-data ? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793631#action_12793631 ] Sean Owen commented on MAHOUT-103: -- Yep, fine by me. I think it would be far easier to unit test / document the 'final' version rather than something that will change notably? really I'm talking about formatting, style, use of libraries, etc. How about I post my version of the patch now? I have not run it. I trust you're on top of that. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793676#action_12793676 ] Jake Mannix commented on MAHOUT-103: I've actually got a bunch of variations of this in the .item package as well, but I haven't got them all fully working yet. I'm hoping they're a bit faster than what's in there now. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: MAHOUT-103.patch, mahout-103.patch.v1, mahout-103.patch.v2, prepare.pl, run.sh Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791971#action_12791971 ] Ankur commented on MAHOUT-103: -- Ok, so here is the next version which I again re-wrote completely :-( for performance reasons. The version now computes item similarity and uses that to generate recommendations in truely hadoop fashion. In a nutshell the recommendations are generated in 2 steps:- 1. Join item-similarity data (generated via analyzing co-occurence) with user-click data 2. Group output of step 1. on user key so that we recieve all potential candidates for a user in a reducer and also all items already clicked/seen by him so that they can be excluded from final recommendations set. Also attached 1. Perl script to convert the netflix data into required format (userId \t movieId) 2. Bash script used to run in on 50 node hadoop cluster. The recommendations are generated for all the users in less than 45 min. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781838#action_12781838 ] Ankur commented on MAHOUT-103: -- For this co-occurrence based recommender I am planning to write a set of map-reduce jobs that compute recommendations for users as folllowing:- 1. Take user's item history 2. for each item in his history fetch the top-N similar items. (Similarity based on co-occurrence) 3. Add the co-occurrence scores if an item appears more than once (NOT weighted avg). Consider an e.g. user-history { M1, M2, M3 } and top - 3 similar movies for each of these along with co-occurrence scores M1 - (A, 5), (B, 4), (C, 2) M2 - (D, 6), (E, 3), (F, 2) M3 - (G, 8), (C, 5), (B, 2) So the final scores in decreasing order will look like (G, 8) (C, 7) (B, 6) (D, 6) (A, 5) (E, 3) (F, 2) The idea I want to capture is that a candidate item gets higher score if its similar to more items in user's click history. Do you see any issue with this approach ? Any other better approach that you can think of ? As for the precision-recall test, I am still trying to see how to divide the data in 'train' and 'test' for a fair evaluation. How do we do it in the existing code ? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781858#action_12781858 ] Sean Owen commented on MAHOUT-103: -- Yes, this is basically item-based recommendation. With some superficial changes, it would exactly fit that model. Co-occurrence here is like a similarity metric, which is ultimately used as a weighting. Canonically this value would be in [-1,1], and you can easily map [1,...) into that range of course. Next you're sort of estimating preferences when you add up co-occurrence values. Canonically, you'd be doing a weighted average over M1 - M3. This is the same thing -- you're just not dividing by 3. The result is conceptually the same, though different approaches would yield slightly different results. I'm not necessarily suggesting you change the algorithm. At the same time I am also about to implement this very same thing -- the more 'canoncial' form, to go hand-in-hand with the existing GenericItemBasedRecommender. I'd rather avoid duplication, and would like to make the Hadoop-based implementation as analogous to the existing code as possible. All I'd say is, go ahead, and maybe we look at generalizing it or shifting these concepts towards the canonical setup later. Look at GenericIRStatsEvaluator and subclass for precision-recall approaches. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778838#action_12778838 ] Sean Owen commented on MAHOUT-103: -- It looks fine to me, with one request -- put it in a subpackage of the .hadoop package? I'm going to rearrange the bits already in there into subpackages as they no longer share one related purpose. From there if I have any other comments we can look at those after it's committed. I'm certain it's good enough to get into the repo now, if you believe it's ready. In about 2 weeks I am beginning writing the recommenders-in-Hadoop chapter, so this is very timely. I've been worried since my Hadoop-related code has been stymied by a Hadoop bug that's still not fixed. I am hoping your approach has a way around it. I'd like to synthesize one approach to using Hadoop 0.20+ based on our implementations and then look to making the whole project consistent in this regard. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778911#action_12778911 ] Ankur commented on MAHOUT-103: -- Thanks for the quick lookup, appreciate that :-). Putting in a subpackage, sure, for now I'll just leave all the main code under one subpackage (how about 'bigram') until u have it sorted out. As for the code, once I have the test code ready for netflix dataset and at least one unit test, it will be good to go. One question, How do we apply precision-recall or RMSE or any other evaluation technique to the results since all we are doing is counting co-occurrence ? Do u have the JIRA for this hadoop related bug? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778950#action_12778950 ] Sean Owen commented on MAHOUT-103: -- I don't have a JIRA, just some threads on the mailing list. I'm going to dig back in to this next week and if I still see the problem formally report the bug. I hope it's cleared up in the latest code. Yeah this code just implements counting co-occurrence, which isn't a complete recommender. Ideally each of these subpackages is a Hadoop-based recommender system that outputs recommendations. When the input has rating values, you will output estimated rating values too, I'd imagine. That's good, but not quite what's needed to conduct an RMSE evaluation of the estimates. For that we'd want to hold back, say, 5% of the data and see how well it estimated the ratings of that 5%. But the output is recommendations, which don't necessarily include that 5% of test data. (You could write a separate job that does it, sure.) But seems relatively easier to conduct a precision-recall test. Identify for each user some good recommendations, perhaps their favorite items. Remove those from the input. See how much of it gets recommended back in the output. From that you can compute precision and recall figures on the recommendations. It's all a nice-to-have -- to start I'd like to have a couple end-to-end, consistent recommenders based on Hadoop. I can see three right now: - Your co-occurrence-based system - The, er, dot-product-based item-based recommender Ted sketched, which I'll write. Not sure what to call it. - The pseudo-distributed system already in there Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch, mahout-103.patch.v1 Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776921#action_12776921 ] Sean Owen commented on MAHOUT-103: -- Re-post an updated patch and happy to give my comments on it. The more the merrier. If it's basically sound I'd like to mention it in the forthcoming book which I'm writing now. I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop the rating. The framework can do this automatically too if you like in the DataModel. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776939#action_12776939 ] Ankur commented on MAHOUT-103: -- Re-post an updated patch Sure I'll have the updated code coming by early next week. If it's basically sound I'd like to mention it +10, The more people know about it the better chances it has of being used :-) I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop the rating ... Simply dropping the rating might introduce too much noise. I was thinking of keeoing only those that have ratings 2.5 (or 2 to be more liberal). Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776951#action_12776951 ] Sean Owen commented on MAHOUT-103: -- That last point is interesting. Another school of thought is that rating something, even negatively, suggests you have a closer association to that thing than to the millions of other things you've never heard of. Let's say you rate Bach a 5 and Brahms a 4 and Mendelssohn a 1.5. Would you rather recommend a Mendelssohn recording to this person, or death metal? This is my understanding of the intuition I've gotten from Ted, and seems to bear out somewhat in practice, that ratings have a lot less info than one would think. Well it's obviously something one can evaluate within the framework with the evaluator code to decide for sure. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776966#action_12776966 ] Ankur commented on MAHOUT-103: -- In that case dropping ratings might not be such a good idea and may lead to bad results. Consider the following movies that a user might have seen with the scores Matrix - 4.5 Matrix Reloaded - 2.5 Matrix Revolutions - 2 Assuming that a lot of people have watched these movies and didn't like the subsequent two versions, they still will get high similarity scores w.r.t Matrix going purely by co-occurrence. IMHO, that leaves us with the following 2 alternatives :- 1. Add the ratings when counting co-occurrence and hope that better ones will stand out even if they co-occur less. 2. Apply a Re-scorer that re-ranks the the similar items for a given item based on their average scores. Point 1 is something I am thinking of trying out. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776986#action_12776986 ] Sean Owen commented on MAHOUT-103: -- What's the problem in this example? Two people that have both seen all three Matrix films are probably similar. All the more so if they've rated the first one highly and the other two poorly. You'd correctly identify them as similar with or without ratings here. The issue, I suppose, comes up when you encounter someone who didn't like the first one and liked the other two (strange, I know). Without pref values, we'd draw the same conclusion -- they have some similarity. With pref values, most metrics would say they are very dissimilar. I actually think that's the wrong conclusion! The fact that two people bothered to watch all three says much more about their similarities than the variance in ratings says about their differences. I'd still guess they're sorta-similar, and metrics without pref values would tend to draw the more correct conclusion. Of course there's no one right answer, and we can easily construct situations where throwing out pref values indeed hurts the result. I'm only asserting that it's entirely possible, in real data sets, for ratings to *hurt* on the whole. Let's start by adding the basic approach and then keep going to look at variations. I at least have some global knowledge of how the framework is set up and could help design in these variations in a way that's consistent with the framework. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
The classic technique is to get total counts for B and for all events and assume that these are effectively *B and *B' (that is, the number of times that B exists is the number of times that either A or A' occurred with B). This is usually just fine. Then k(A'B) = k(B) - k(AB) and k(A'B') = k(*) - k(B) - k(AB'). It is commonly necessary to compute the total and the individual item counts like k(B) in a separate pass through the data, but some storage schemes may make that unnecessary. To get this right requires a careful touch. I try to keep in mind that if we start with a binary occurrence matrix U = users x items, then the item cooccurrence matrix should be exactly U' U = items x items. If U is represented as a set of (user, item) pairs, then we can compute U' U by a join to self followed by counting. This usually results in a list of (item, item, count) triples. The item counts are then the rowsums of U' U and the overall sum of U' U is the total count. These can easily be computed from U' U and take much less time than computing U' U. Note that U' U is symmetrical so you should only need to compute the rowsums. Checking that the column sums are the same is a nice exercise, but not necessary in production. The rowsums are represented (item, count) pairs and the overall total is just a count. The final scoring of the cooccurrence counts requires that you join the triples that form U' U once with the rowsums against the first member of the triple and then again with the against the second member of the triple. You can carry around the overall total separately, but there is an implied join against this single number as well. This gives you tuples of the form (a, b, count_ab, count_a, count_b, total). The counts you need for the contingency table are count_ab, count_a - count_ab count_b - count_ab,total - count_a - count_b + count_ab Does this answer your question? I can make the log-likelihood code have a method that takes count_ab, count_a, count_b and total as arguments. On Sun, Mar 22, 2009 at 10:32 PM, Ankur Goel ankur.g...@corp.aol.comwrote: As you can see from each entry we can get the values for AB and AB'. A'B and A'B' are not available directly in a single result and need to be computed. It would be nice to hear how you are planning to implement this in map-red. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)
Re: [jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
Ankur, What form will the counts be in when you need this function? Four integers separately available? Values in a view of a matrix? I will be happy to adapt some code to compute the measure you need. On Wed, Mar 18, 2009 at 9:45 PM, Ankur (JIRA) j...@apache.org wrote: map-red implementation of the log-likelihood ratio test as described in Ted's paper. -- Ted Dunning, CTO DeepDyve
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683516#action_12683516 ] Ted Dunning commented on MAHOUT-103: Hmmm I actually think of a click as the relation that connects a user to an item. As such, it is distinct from either. And I routinely do recommendation like computations that involve users, network entities, query terms, documents, and other things that you would call users as they relate (by abstract clicks) to users, query terms, videos, music, web pages, network entities, words, query terms and other things that you would call items. There is a horrible tension here between naming things by their most common usage and expecting programmers to realize that they really are abstract entities or naming things in a total abstract way and risking that no programmers ever catch on. An example of an abstract naming that derives from linguistic terminology might be Agent (instead of User), Relation (instead of Click) and Target (instead of Item). This makes the general interaction be Relation \subsetof Agent x Target. I wouldn't recommend this, however, because (as you say) people generally describe social algorithms in excessively concrete ways. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683170#action_12683170 ] Sean Owen commented on MAHOUT-103: -- 1. How do you feel about, therefore, changing to use more abstract objects rather than, say, Click? These objects could be the existing ones, or modified or new ones. I think as you say the existing objects are about what is needed. That way the solution is that much more reusable. Same with the job -- the more it uses abstract/standard classes, the more reusable I think it looks. 2. Yeah the two interfaces are nearly identical: provide a method that takes two items as input and a numerical score as output. I suppose it just makes sense to use the existing ItemSimilarity interface in this section of the code. 3. Good question, here is my brief digression: The code was originally written with an on-line model in mind -- recommendations happen in real-time. Over time that has proved inefficient or impractical for large data sets, though it remains quite nice for small- to medium-size data sets. Hence i have attempted to preserve the real-time model at the core, and build a batch-oriented extension around it using Hadoop. The two are a bit separate, and that is fine. So in this section of the code, I don't mind attaching Hadoop-related jobs that are not intimately connected to the core code. I am trying to keep them as consistent as possible so that the original on-line and newer off-line models don't evolve into two separate worlds within this part of the code. To be specific... well I don't know, I don't have a problem with adding this job actually. Ideally we build a bit more around it: takes as input the standard preference-file format as used by FileDataModel, and outputs a file format that can be ready by a new ItemSimillarity implementation that would read and cache all these results. That would be a nice step towards integrating with the core code. This is something I have been remiss in - I wrote a job to do the pre-computation of item-item diffs for slope one but never wrote an implementation of DiffStorage that would read this output and operate based on those results. This would close the loop. How about we make #3 my part of this issue, to complete the connection between this job and the core code a bit more? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683232#action_12683232 ] Ted Dunning commented on MAHOUT-103: 1. How do you feel about, therefore, changing to use more abstract objects rather than, say, Click? How is click more or less abstract than the term user? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683238#action_12683238 ] Sean Owen commented on MAHOUT-103: -- The comparison would be to Item. You could say that's as domain-specific as Click; I'd suggest that User/Item are the 'abstract' concepts in this context since collaborative filtering is invariably explained in terms of users and items, though of course your user or item can be whatever you like. At least, there is no need to have both Click and Item -- unless this particular context requires one to store more information about a click as an item, in which case it should at least implement Item. But I don't think that's the case. The good news is that this work doesn't seem to only apply to processing click logs, so, I'm suggesting it might be even more useful to express it in terms of the 'abstract' concepts in this context. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683251#action_12683251 ] Sean Owen commented on MAHOUT-103: -- The comparison would be to Item. You could say that's as domain-specific as Click; I'd suggest that User/Item are the 'abstract' concepts in this context since collaborative filtering is invariably explained in terms of users and items, though of course your user or item can be whatever you like. At least, there is no need to have both Click and Item -- unless this particular context requires one to store more information about a click as an item, in which case it should at least implement Item. But I don't think that's the case. The good news is that this work doesn't seem to only apply to processing click logs, so, I'm suggesting it might be even more useful to express it in terms of the 'abstract' concepts in this context. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682915#action_12682915 ] Ankur commented on MAHOUT-103: -- Hey Sean, Thanks for review comments. Some specific questions 1. This indeed is doing approximately the same thing as TanimotoCoefficientSimilarity and BooleanPreferenceUser. The difference being that similarity computations is parallelized in map-reduce. 2. The idea of introducing a FitnessEvaluator was to allow people to apply domain specific things when comparing a preference. Are you suggesting the replacement of FitnessEvaluator with ItemSimilarity ? 3. The Hadoop job was written to run this thing stand-alone. What modifications do you feel would be appropriate for integration into the framework? Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood
[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668475#action_12668475 ] Ankur commented on MAHOUT-103: -- I hoping to make the above improvements after I get some review comments. Co-occurence based nearest neighbourhood Key: MAHOUT-103 URL: https://issues.apache.org/jira/browse/MAHOUT-103 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Ankur Assignee: Ankur Attachments: jira-103.patch Nearest neighborhood type queries for users/items can be answered efficiently and effectively by analyzing the co-occurrence model of a user/item w.r.t another. This patch aims at providing an implementation for answering such queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.