ItemSimilarityJob's results differ from non-distributed version

Greg H Wed, 23 Nov 2011 23:31:37 -0800

Hello,

I've been using Mahout's item-based recommender on several different
implicit datasets. First I computed the item-item similarities by just
passing in an ItemSimilarity and DataModel to the GenericItemSimilarity
class but lately I've been using ItemSimilarityJob to calculate them on a
Hadoop cluster. However, I've found that there is a significant difference
in the results of my experiments depending on which method I use to
calculate the similarities.


For example, when I use the public MovieLens 1M dataset (which I've
converted into an implicit dataset), calculating the similarities with:

ItemBasedRecommender recommender = new
GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new
TanimotoCoefficientSimilarity(dataModel), dataModel));

gives 0.25 precision when splitting the data for each user at the ratio of
80% - 20% and then looking at only the top 5 recommended items. However,
when I compute the similarities with ItemSimilarityJob using the following
command:

hadoop jar mahout-core-0.6-SNAPSHOT-job.jar
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
-Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output
--similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT
--maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true

I only get 0.23 precision. The results differ even more significantly in
some other datasets that I've been working with. I know that
ItemSimilarityJob prunes some items so I've tried many different settings
for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves
the results it still never matches the non-distributed version. Shouldn't
the results be the same no matter which version I use to calculate the
similarities?

Thank you,
Greg

ItemSimilarityJob's results differ from non-distributed version

Reply via email to