[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860223#action_12860223 ]
Sean Owen commented on MAHOUT-305: ---------------------------------- And now more thoughts: Yes all the code is checked in. This is still running perhaps slower than I'd like. The step to distributing the computation more slowed things down considerably in the I/O phases -- but avoided use of MapFile which was in the end just being used very wrongly. So a net win. The slowest step by far is outputting the partial vector products. Each is as big as a column of the co-occurrence matrix (which is sparse, yes), and one is output for each preference value. That's huge. This would be an ideal place for a combiner but it's a reducer, so it's not available (?) Co-occurrence is also slowish. It does use a combiner but to get a good hit rate, it needs to have a very large buffer. Everything works quite well if you're willing to prune data. For example, very roughly, on a 10M rating data set -- *but keeping only 20 prefs per user for each of 70,000 users* -- the total time per users is in seconds of machine time. Not too bad. But take that off and this still balloons quite a bit. Naturally, pruning is a good thing but it seems like we should be able to speed up more. > Combine both cooccurrence-based CF M/R jobs > ------------------------------------------- > > Key: MAHOUT-305 > URL: https://issues.apache.org/jira/browse/MAHOUT-305 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.2 > Reporter: Sean Owen > Assignee: Ankur > Priority: Minor > > We have two different but essentially identical MapReduce jobs to make > recommendations based on item co-occurrence: > org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be > merged. Not sure exactly how to approach that but noting this in JIRA, per > Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.