[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860895#action_12860895
 ] 

Sean Owen commented on MAHOUT-305:
----------------------------------

Most broadly, the input is item1->item2 pairs and the final output of the 
co-occurrence step is item1->((item2,count2),(item3,count3),...) -- rows of the 
co-occurrence matrix. I'm trying to do it in one pass, so need the reducer 
keyed by item1 instead of a pair. So I generate item1->(item2,count) in the 
mapper, combine, and then reduce to the vector in one go.


But OK you're suggesting to unpack that, and try to first output 
(item1,item2)->count, and from there a second stage can generate the rows.

In ItemSimilarityEstimator, the map seems to emit (item1,item2)->(item2,count). 
OK, so I understood the 'redundant' item2 in the key to be the secondary sort 
trick, so you can efficiently sum up counts and emit the final 
(item1,item2)->count. But then why is the value (item2,count) and not just 
count?

And why is a priority queue needed? Is it just used to store and emit only the 
top item-item pairs by count? That makes sense, though you've made the size 20, 
isn't that too small? or is it that the user must set this to a reasonable 
size. Or you could scrap the priority queue and filter based on a count cutoff. 
I'm fond of culling co-occurrence of 1 and keeping everything else. No queue 
needed for that though it doesn't cap the size of the resulting matrix.




What worries me is the size of the map output. Spilling an (item1,item2) pair 
for every co-occurrence is absolutely massive. With P preferences, U users, and 
I items, you're spilling about U*(P/U)^2 pairs = P^2/U. With a billion 
preferences that's getting easily into quadrillions.

Now of course the combiner, in theory, prevents almost all of this from 
spilling to disk. It sums up counts, so the number of pairs output is more on 
the order of I^2. In practice, I think the combiner doesn't have enough memory. 
Before it has to spill its queue through the combiner, it rarely tallies up the 
same item-item cooccurrence more than once. On a smallish data set I find the 
'hit rate' about 10% in this regards, even with io.sort.mb increased from a 
healthy 200MB to 1000MB.

And this is what's killing the job, I think, emitting so many low-count pairs. 
So that's why I was trying to be very aggressive in the combiner in throwing 
out data, and maybe need to do more. And being super-aggressive can mean 
capping the size of that intermediate map quite a bit more. And then that also 
kind of addresses the scalability bottleneck issue, and enables this to happen 
in one go anyway.

Or perhaps I miss why emitting pairs with count is actually going to be very 
scalable. It's very attractive after the map phase, but it's the spill after 
the map that's the problem I think.


TupleWritable is copied and paste from Hadoop right?

> Combine both cooccurrence-based CF M/R jobs
> -------------------------------------------
>
>                 Key: MAHOUT-305
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-305
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.2
>            Reporter: Sean Owen
>            Assignee: Ankur
>            Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to