Hello, I've been using Mahout's item-based recommender on several different implicit datasets. First I computed the item-item similarities by just passing in an ItemSimilarity and DataModel to the GenericItemSimilarity class but lately I've been using ItemSimilarityJob to calculate them on a Hadoop cluster. However, I've found that there is a significant difference in the results of my experiments depending on which method I use to calculate the similarities.
For example, when I use the public MovieLens 1M dataset (which I've converted into an implicit dataset), calculating the similarities with: ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new TanimotoCoefficientSimilarity(dataModel), dataModel)); gives 0.25 precision when splitting the data for each user at the ratio of 80% - 20% and then looking at only the top 5 recommended items. However, when I compute the similarities with ItemSimilarityJob using the following command: hadoop jar mahout-core-0.6-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob -Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT --maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true I only get 0.23 precision. The results differ even more significantly in some other datasets that I've been working with. I know that ItemSimilarityJob prunes some items so I've tried many different settings for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves the results it still never matches the non-distributed version. Shouldn't the results be the same no matter which version I use to calculate the similarities? Thank you, Greg