[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860223#action_12860223
 ] 

Sean Owen commented on MAHOUT-305:
----------------------------------

And now more thoughts:

Yes all the code is checked in. 

This is still running perhaps slower than I'd like. The step to distributing 
the computation more slowed things down considerably in the I/O phases -- but 
avoided use of MapFile which was in the end just being used very wrongly. So a 
net win.

The slowest step by far is outputting the partial vector products. Each is as 
big as a column of the co-occurrence matrix (which is sparse, yes), and one is 
output for each preference value. That's huge. This would be an ideal place for 
a combiner but it's a reducer, so it's not available (?)

Co-occurrence is also slowish. It does use a combiner but to get a good hit 
rate, it needs to have a very large buffer.

Everything works quite well if you're willing to prune data. For example, very 
roughly, on a 10M rating data set -- *but keeping only 20 prefs per user for 
each of 70,000 users* -- the total time per users is in seconds of machine 
time. Not too bad.

But take that off and this still balloons quite a bit. Naturally, pruning is a 
good thing but it seems like we should be able to speed up more.

> Combine both cooccurrence-based CF M/R jobs
> -------------------------------------------
>
>                 Key: MAHOUT-305
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-305
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.2
>            Reporter: Sean Owen
>            Assignee: Ankur
>            Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to