Isn't that probably the difference, that the distributed job is sampling the data? I would not expect an exact match.
On Thu, Nov 24, 2011 at 7:51 AM, Sebastian Schelter <[email protected] > wrote: > Another remark: Using a small amount of similarItemsPerItem (say 30-50) > should give better results than using all. > > --sebastian > > > On 24.11.2011 02:14, Greg H wrote: > > Hello, > > > > I've been using Mahout's item-based recommender on several different > > implicit datasets. First I computed the item-item similarities by just > > passing in an ItemSimilarity and DataModel to the GenericItemSimilarity > > class but lately I've been using ItemSimilarityJob to calculate them on a > > Hadoop cluster. However, I've found that there is a significant > difference > > in the results of my experiments depending on which method I use to > > calculate the similarities. > > > > For example, when I use the public MovieLens 1M dataset (which I've > > converted into an implicit dataset), calculating the similarities with: > > > > ItemBasedRecommender recommender = new > > GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new > > TanimotoCoefficientSimilarity(dataModel), dataModel)); > > > > gives 0.25 precision when splitting the data for each user at the ratio > of > > 80% - 20% and then looking at only the top 5 recommended items. However, > > when I compute the similarities with ItemSimilarityJob using the > following > > command: > > > > hadoop jar mahout-core-0.6-SNAPSHOT-job.jar > > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob > > -Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output > > --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT > > --maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true > > > > I only get 0.23 precision. The results differ even more significantly in > > some other datasets that I've been working with. I know that > > ItemSimilarityJob prunes some items so I've tried many different settings > > for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves > > the results it still never matches the non-distributed version. Shouldn't > > the results be the same no matter which version I use to calculate the > > similarities? > > > > Thank you, > > Greg > > > >
