[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860895#action_12860895 ]
Sean Owen commented on MAHOUT-305: ---------------------------------- Most broadly, the input is item1->item2 pairs and the final output of the co-occurrence step is item1->((item2,count2),(item3,count3),...) -- rows of the co-occurrence matrix. I'm trying to do it in one pass, so need the reducer keyed by item1 instead of a pair. So I generate item1->(item2,count) in the mapper, combine, and then reduce to the vector in one go. But OK you're suggesting to unpack that, and try to first output (item1,item2)->count, and from there a second stage can generate the rows. In ItemSimilarityEstimator, the map seems to emit (item1,item2)->(item2,count). OK, so I understood the 'redundant' item2 in the key to be the secondary sort trick, so you can efficiently sum up counts and emit the final (item1,item2)->count. But then why is the value (item2,count) and not just count? And why is a priority queue needed? Is it just used to store and emit only the top item-item pairs by count? That makes sense, though you've made the size 20, isn't that too small? or is it that the user must set this to a reasonable size. Or you could scrap the priority queue and filter based on a count cutoff. I'm fond of culling co-occurrence of 1 and keeping everything else. No queue needed for that though it doesn't cap the size of the resulting matrix. What worries me is the size of the map output. Spilling an (item1,item2) pair for every co-occurrence is absolutely massive. With P preferences, U users, and I items, you're spilling about U*(P/U)^2 pairs = P^2/U. With a billion preferences that's getting easily into quadrillions. Now of course the combiner, in theory, prevents almost all of this from spilling to disk. It sums up counts, so the number of pairs output is more on the order of I^2. In practice, I think the combiner doesn't have enough memory. Before it has to spill its queue through the combiner, it rarely tallies up the same item-item cooccurrence more than once. On a smallish data set I find the 'hit rate' about 10% in this regards, even with io.sort.mb increased from a healthy 200MB to 1000MB. And this is what's killing the job, I think, emitting so many low-count pairs. So that's why I was trying to be very aggressive in the combiner in throwing out data, and maybe need to do more. And being super-aggressive can mean capping the size of that intermediate map quite a bit more. And then that also kind of addresses the scalability bottleneck issue, and enables this to happen in one go anyway. Or perhaps I miss why emitting pairs with count is actually going to be very scalable. It's very attractive after the map phase, but it's the spill after the map that's the problem I think. TupleWritable is copied and paste from Hadoop right? > Combine both cooccurrence-based CF M/R jobs > ------------------------------------------- > > Key: MAHOUT-305 > URL: https://issues.apache.org/jira/browse/MAHOUT-305 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.2 > Reporter: Sean Owen > Assignee: Ankur > Priority: Minor > > We have two different but essentially identical MapReduce jobs to make > recommendations based on item co-occurrence: > org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be > merged. Not sure exactly how to approach that but noting this in JIRA, per > Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.