[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836543#action_12836543
 ] 

Ankur commented on MAHOUT-305:
------------------------------

Sean, Thanks for filing the jira. Nothing points from our discussion here.

1. Need to decide on the dataset to run both the implementations on. I have 
netflix dataset in mind but a strange thing I observed during my tests with it 
is that there were 2 - 3 users who rated more than 10,000 movies! This seemed a 
little odd to me. Can you or some else who has had experience with the dataset 
validate my observation ?  

2. Both the implementations need to run on dataset in the identical environment 
to gauge performance and accuracy. For accuracy I believe we need to do a 
Precision-Recall test. My understanding of it is that 

      a) Do a 80-20 split of the data (80% train and 20% test) with split 
happening on a timeline. 
      b) Feed training data to the algorithm and generate recommendations for a 
subset of users from training data. 
      c) Compare those recommendations with items actually present in the 
history of user in test data.
      d) Calculate precision = tp / (tp + fp) = (recommendations actually 
present in user's history) / (total items recommended)
      e) Calculate recall = tp / (tp + fn) =    (recommendations actually 
present in user's history) / (total items in user's history)
      f) Finally take a simple avg of both across all the users to get approx 
global precision/recall. 

please feel free to correct any of the step above if I misunderstood anything.

> Combine both cooccurrence-based CF M/R jobs
> -------------------------------------------
>
>                 Key: MAHOUT-305
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-305
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.2
>            Reporter: Sean Owen
>            Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to