Sorry for taking so long to reply but I think I found where the problem is.
After comparing the similarities more closely I found out that the problem
wasn't with ItemSimilarityJob. The difference in the number of similarities
being computed was simply because ItemSimilarityJob is excluding the
item-item similarities that equal zero. However, this causes a problem when
using GenericItemBasedRecommender to make recommendations.
The problem starts from line 345 of GenericItemBasedRecommender:
if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) {
average.addDatum(estimate);
}
This works as expected if you use ItemSimilarityJob because the estimate
for dissimilar items will be NaN and this causes the final average to be
zero. However, if you use the non-distributed version to calculate the
similarities the estimate for dissimilar items will be zero, which will not
necessarily cause the final average to be zero thus still allowing this
item to be used even though it is not similar to all of the user's
preferred items. So either the non-distributed version needs to be modified
to not store similarities equaling zero, or the code above needs to be
changed to handle the case where estimate is zero.
After fixing this calculating similarities using the distributed and
non-distributed version gives the same results in my experiments.
Thanks,
Greg
On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <[email protected]> wrote:
> This is good advice, but in many cases, the number of positive ratings far
> outnumbers the number of negatives so the negatives may have limited
> impact.
>
> You absolutely should produce a histogram to see what kinds of ratings you
> are getting and how many users are producing them.
>
> You should also consider implicit feedback. This tends to be much higher
> value than ratings for two reasons:
>
> 1) there is usually 100x more of it
>
> 2) you generally care more about what people do than what they say.
> Implicit feedback is based on what people do and ratings are what they
> say. Inferring what people will do based on what they say is more
> difficult than inferring what they will do based on what they do.
>
> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <[email protected]>
> wrote:
>
> > A small tip on the side: For implicit data, you should also include
> > "negative" ratings as those still contain a lot of information about the
> > taste and will of engagement of the user. No need to use only the 3+
> > ratings.
> >
>