Isn't that probably the difference, that the distributed job is sampling
the data? I would not expect an exact match.

On Thu, Nov 24, 2011 at 7:51 AM, Sebastian Schelter <[email protected]
> wrote:

> Another remark: Using a small amount of similarItemsPerItem (say 30-50)
> should give better results than using all.
>
> --sebastian
>
>
> On 24.11.2011 02:14, Greg H wrote:
> > Hello,
> >
> > I've been using Mahout's item-based recommender on several different
> > implicit datasets. First I computed the item-item similarities by just
> > passing in an ItemSimilarity and DataModel to the GenericItemSimilarity
> > class but lately I've been using ItemSimilarityJob to calculate them on a
> > Hadoop cluster. However, I've found that there is a significant
> difference
> > in the results of my experiments depending on which method I use to
> > calculate the similarities.
> >
> > For example, when I use the public MovieLens 1M dataset (which I've
> > converted into an implicit dataset), calculating the similarities with:
> >
> > ItemBasedRecommender recommender = new
> > GenericItemBasedRecommender(dataModel, new GenericItemSimilarity(new
> > TanimotoCoefficientSimilarity(dataModel), dataModel));
> >
> > gives 0.25 precision when splitting the data for each user at the ratio
> of
> > 80% - 20% and then looking at only the top 5 recommended items. However,
> > when I compute the similarities with ItemSimilarityJob using the
> following
> > command:
> >
> > hadoop jar mahout-core-0.6-SNAPSHOT-job.jar
> > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
> > -Dmapred.input.dir=ml1m -Dmapred.output.dir=ml1m-output
> > --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT
> > --maxSimilaritiesPerItem 10000 --maxPrefsPerUser 10000 --booleanData true
> >
> > I only get 0.23 precision. The results differ even more significantly in
> > some other datasets that I've been working with. I know that
> > ItemSimilarityJob prunes some items so I've tried many different settings
> > for maxSimilaritiesPerItem and maxPrefsPerUser and although this improves
> > the results it still never matches the non-distributed version. Shouldn't
> > the results be the same no matter which version I use to calculate the
> > similarities?
> >
> > Thank you,
> > Greg
> >
>
>

Reply via email to