Hi Greg, Thank you for your time debugging this!
Maybe we should simply make TanimotoCoefficientSimilarity return Double.NaN in case of no overlap? --sebastian On 29.11.2011 06:28, Greg H wrote: > Sorry for taking so long to reply but I think I found where the problem is. > After comparing the similarities more closely I found out that the problem > wasn't with ItemSimilarityJob. The difference in the number of similarities > being computed was simply because ItemSimilarityJob is excluding the > item-item similarities that equal zero. However, this causes a problem when > using GenericItemBasedRecommender to make recommendations. > > The problem starts from line 345 of GenericItemBasedRecommender: > > if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) { > average.addDatum(estimate); > } > > This works as expected if you use ItemSimilarityJob because the estimate > for dissimilar items will be NaN and this causes the final average to be > zero. However, if you use the non-distributed version to calculate the > similarities the estimate for dissimilar items will be zero, which will not > necessarily cause the final average to be zero thus still allowing this > item to be used even though it is not similar to all of the user's > preferred items. So either the non-distributed version needs to be modified > to not store similarities equaling zero, or the code above needs to be > changed to handle the case where estimate is zero. > > After fixing this calculating similarities using the distributed and > non-distributed version gives the same results in my experiments. > > Thanks, > Greg > > On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > >> This is good advice, but in many cases, the number of positive ratings far >> outnumbers the number of negatives so the negatives may have limited >> impact. >> >> You absolutely should produce a histogram to see what kinds of ratings you >> are getting and how many users are producing them. >> >> You should also consider implicit feedback. This tends to be much higher >> value than ratings for two reasons: >> >> 1) there is usually 100x more of it >> >> 2) you generally care more about what people do than what they say. >> Implicit feedback is based on what people do and ratings are what they >> say. Inferring what people will do based on what they say is more >> difficult than inferring what they will do based on what they do. >> >> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <s...@apache.org> >> wrote: >> >>> A small tip on the side: For implicit data, you should also include >>> "negative" ratings as those still contain a lot of information about the >>> taste and will of engagement of the user. No need to use only the 3+ >>> ratings. >>> >> >