Another remark: Using a small amount of similarItemsPerItem (say 30-50) should give better results than using all.
--sebastian On 24.11.2011 02:14, Greg H wrote: > Hello, > > I've been using Mahout's item-based recommender on several different > implicit datasets. First I computed the item-item similarities by just > passing in an ItemSimilarity and DataModel to the GenericItemSimilarity > class but lately I've been using ItemSimilarityJob to calculate them on a > Hadoop cluster. However, I've found that there is a significant difference > in the results of my experiments depending on which method I use to > calculate the similarities. > > For example, when I use the public MovieLens 1M dataset (which I've > converted into an implicit dataset), calculating the similarities with: > > ItemBasedRecommender recommender = new > GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new > TanimotoCoefficientSimilarity(dataModel), dataModel)); > > gives 0.25 precision when splitting the data for each user at the ratio of > 80% - 20% and then looking at only the top 5 recommended items. However, > when I compute the similarities with ItemSimilarityJob using the following > command: > > hadoop jar mahout-core-0.6-SNAPSHOT-job.jar > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob > -Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output > --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT > --maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true > > I only get 0.23 precision. The results differ even more significantly in > some other datasets that I've been working with. I know that > ItemSimilarityJob prunes some items so I've tried many different settings > for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves > the results it still never matches the non-distributed version. Shouldn't > the results be the same no matter which version I use to calculate the > similarities? > > Thank you, > Greg >
