Yeah the non-distributed implementation returns NaN in this case, which is
a bit of an abuse, since it is defined to be 0. In practice I have always
thought that is the right thing to do for consistency with other
implementations, where no overlap means undefined similarity. You could
argue it either way.

On Tue, Nov 29, 2011 at 9:14 AM, Sebastian Schelter <[email protected]> wrote:

> Hi Greg,
>
> Thank you for your time debugging this!
>
> Maybe we should simply make TanimotoCoefficientSimilarity return
> Double.NaN in case of no overlap?
>
> --sebastian
>
> On 29.11.2011 06:28, Greg H wrote:
> > Sorry for taking so long to reply but I think I found where the problem
> is.
> > After comparing the similarities more closely I found out that the
> problem
> > wasn't with ItemSimilarityJob. The difference in the number of
> similarities
> > being computed was simply because ItemSimilarityJob is excluding the
> > item-item similarities that equal zero. However, this causes a problem
> when
> > using GenericItemBasedRecommender to make recommendations.
> >
> > The problem starts from line 345 of GenericItemBasedRecommender:
> >
> > if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) {
> >   average.addDatum(estimate);
> > }
> >
> > This works as expected if you use ItemSimilarityJob because the estimate
> > for dissimilar items will be NaN and this causes the final average to be
> > zero. However, if you use the non-distributed version to calculate the
> > similarities the estimate for dissimilar items will be zero, which will
> not
> > necessarily cause the final average to be zero thus still allowing this
> > item to be used even though it is not similar to all of the user's
> > preferred items. So either the non-distributed version needs to be
> modified
> > to not store similarities equaling zero, or the code above needs to be
> > changed to handle the case where estimate is zero.
> >
> > After fixing this calculating similarities using the distributed and
> > non-distributed version gives the same results in my experiments.
> >
> > Thanks,
> > Greg
> >
> > On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <[email protected]>
> wrote:
> >
> >> This is good advice, but in many cases, the number of positive ratings
> far
> >> outnumbers the number of negatives so the negatives may have limited
> >> impact.
> >>
> >> You absolutely should produce a histogram to see what kinds of ratings
> you
> >> are getting and how many users are producing them.
> >>
> >> You should also consider implicit feedback.  This tends to be much
> higher
> >> value than ratings for two reasons:
> >>
> >> 1) there is usually 100x more of it
> >>
> >> 2) you generally care more about what people do than what they say.
> >>  Implicit feedback is based on what people do and ratings are what they
> >> say.  Inferring what people will do based on what they say is more
> >> difficult than inferring what they will do based on what they do.
> >>
> >> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <[email protected]>
> >> wrote:
> >>
> >>> A small tip on the side: For implicit data, you should also include
> >>> "negative" ratings as those still contain a lot of information about
> the
> >>> taste and will of engagement of the user. No need to use only the 3+
> >>> ratings.
> >>>
> >>
> >
>
>

Reply via email to