On 29.11.2011 11:52, Sean Owen wrote: > Yeah the non-distributed implementation returns NaN in this case, which is > a bit of an abuse, since it is defined to be 0. In practice I have always > thought that is the right thing to do for consistency with other > implementations, where no overlap means undefined similarity. You could > argue it either way.
I created https://issues.apache.org/jira/browse/MAHOUT-902 for this. Should make a nice starter issue to work on, therefore I labeled it with MAHOUT_INTRO_CONTRIBUTE --sebastian > > On Tue, Nov 29, 2011 at 9:14 AM, Sebastian Schelter <[email protected]> wrote: > >> Hi Greg, >> >> Thank you for your time debugging this! >> >> Maybe we should simply make TanimotoCoefficientSimilarity return >> Double.NaN in case of no overlap? >> >> --sebastian >> >> On 29.11.2011 06:28, Greg H wrote: >>> Sorry for taking so long to reply but I think I found where the problem >> is. >>> After comparing the similarities more closely I found out that the >> problem >>> wasn't with ItemSimilarityJob. The difference in the number of >> similarities >>> being computed was simply because ItemSimilarityJob is excluding the >>> item-item similarities that equal zero. However, this causes a problem >> when >>> using GenericItemBasedRecommender to make recommendations. >>> >>> The problem starts from line 345 of GenericItemBasedRecommender: >>> >>> if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) { >>> average.addDatum(estimate); >>> } >>> >>> This works as expected if you use ItemSimilarityJob because the estimate >>> for dissimilar items will be NaN and this causes the final average to be >>> zero. However, if you use the non-distributed version to calculate the >>> similarities the estimate for dissimilar items will be zero, which will >> not >>> necessarily cause the final average to be zero thus still allowing this >>> item to be used even though it is not similar to all of the user's >>> preferred items. So either the non-distributed version needs to be >> modified >>> to not store similarities equaling zero, or the code above needs to be >>> changed to handle the case where estimate is zero. >>> >>> After fixing this calculating similarities using the distributed and >>> non-distributed version gives the same results in my experiments. >>> >>> Thanks, >>> Greg >>> >>> On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <[email protected]> >> wrote: >>> >>>> This is good advice, but in many cases, the number of positive ratings >> far >>>> outnumbers the number of negatives so the negatives may have limited >>>> impact. >>>> >>>> You absolutely should produce a histogram to see what kinds of ratings >> you >>>> are getting and how many users are producing them. >>>> >>>> You should also consider implicit feedback. This tends to be much >> higher >>>> value than ratings for two reasons: >>>> >>>> 1) there is usually 100x more of it >>>> >>>> 2) you generally care more about what people do than what they say. >>>> Implicit feedback is based on what people do and ratings are what they >>>> say. Inferring what people will do based on what they say is more >>>> difficult than inferring what they will do based on what they do. >>>> >>>> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <[email protected]> >>>> wrote: >>>> >>>>> A small tip on the side: For implicit data, you should also include >>>>> "negative" ratings as those still contain a lot of information about >> the >>>>> taste and will of engagement of the user. No need to use only the 3+ >>>>> ratings. >>>>> >>>> >>> >> >> >
