My first 'guess', not knowing all the details, is that it sounds like
the similarity metric is not showing a similarity for many pairs of
items. The item-item connections are few and far between. This means
that estimating a preference value for a new item may be based on just
2 item-item similarities. Though the result is calculated as a
weighted average, it's possible to hit on an item that happens to be
similar to perhaps just 2 of the user's items, and those 2 items were
highly rated (10), and so you get such an answer.

The fact that you're seeing so many of the same answers sort of
supports that conclusion. It's partly due to the fact that I suppose
your similarity metric is returning just a few common values too so
the maths work out similarly for many items.

While there are a number of ways you could attack and hack around the
issue, if I'm right, my diagnosis is that this is probably just not an
effective content-based similarity metric for you. The maths aren't
working out so well, and I think genre alone doesn't intuitively tell
you much about movie similarity.

Tanimoto will return NaN from item similarity when there is no
intersection or union -- that is both sets are empty. But I agree
there is a bit of an asymmetry here since for users you get NaN when
there is no intersection. While mathematically it can return 0, in
practice, it's more consistent with how other similarity metrics think
of "no relation at all".

On Fri, Dec 31, 2010 at 11:37 AM, Ahmet Arslan <[email protected]> wrote:
> Hello Mahout community,
>
> I have an custom ItemSimilarity.
>
> Items can have multiple genre info. And some items do not have genre info 
> available. This custom similarity uses Jaccard coefficient over two genre 
> sets. (A intersect B) / (A union B), so its range is 0 to 1.
>
> A is genre set of item1
> B is genre set of item2
>
> If one of the items does not have genre info, it returns 0.0.
>
> And user,item,pref triples contains pref values between 0.010416667 to 10.0
>
> I am using GenericItemBasedRecommender with getAllOtherItems method overrided.
>
> I have custom IDRescorer that filters out some items. It does not do 
> rescoring.
>
> When I asked recommedation for a user, i see that top 65 items having the 
> same estimated preference. Even some results have perfect estimated pref 
> value 10.
> Is this normal to have same estimated pref values for so many items?
>
> For example:
> top 63 has score of 4.015476
> 63 to 96 has score of 3.9160492
> 97 has 3.8527777
> 98 to 100 has 3.472611
>
> Another question is, TanimotoCoefficientSimilarity never returns Double.NaN 
> for item similarity. However it does for user similarity. How to choose 
> between 0.0 and Double.NaN? What will be the difference in terms of estimated 
> pref?
>
> Thanks.
>
>
>
>

Reply via email to