Re: ItemSimilarityJob's results differ from non-distributed version

Sebastian Schelter Tue, 29 Nov 2011 01:14:52 -0800

Hi Greg,

Thank you for your time debugging this!


Maybe we should simply make TanimotoCoefficientSimilarity return
Double.NaN in case of no overlap?

--sebastian

On 29.11.2011 06:28, Greg H wrote:
> Sorry for taking so long to reply but I think I found where the problem is.
> After comparing the similarities more closely I found out that the problem
> wasn't with ItemSimilarityJob. The difference in the number of similarities
> being computed was simply because ItemSimilarityJob is excluding the
> item-item similarities that equal zero. However, this causes a problem when
> using GenericItemBasedRecommender to make recommendations.
> 
> The problem starts from line 345 of GenericItemBasedRecommender:
> 
> if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) {
>   average.addDatum(estimate);
> }
> 
> This works as expected if you use ItemSimilarityJob because the estimate
> for dissimilar items will be NaN and this causes the final average to be
> zero. However, if you use the non-distributed version to calculate the
> similarities the estimate for dissimilar items will be zero, which will not
> necessarily cause the final average to be zero thus still allowing this
> item to be used even though it is not similar to all of the user's
> preferred items. So either the non-distributed version needs to be modified
> to not store similarities equaling zero, or the code above needs to be
> changed to handle the case where estimate is zero.
> 
> After fixing this calculating similarities using the distributed and
> non-distributed version gives the same results in my experiments.
> 
> Thanks,
> Greg
> 
> On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
>> This is good advice, but in many cases, the number of positive ratings far
>> outnumbers the number of negatives so the negatives may have limited
>> impact.
>>
>> You absolutely should produce a histogram to see what kinds of ratings you
>> are getting and how many users are producing them.
>>
>> You should also consider implicit feedback.  This tends to be much higher
>> value than ratings for two reasons:
>>
>> 1) there is usually 100x more of it
>>
>> 2) you generally care more about what people do than what they say.
>>  Implicit feedback is based on what people do and ratings are what they
>> say.  Inferring what people will do based on what they say is more
>> difficult than inferring what they will do based on what they do.
>>
>> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <s...@apache.org>
>> wrote:
>>
>>> A small tip on the side: For implicit data, you should also include
>>> "negative" ratings as those still contain a lot of information about the
>>> taste and will of engagement of the user. No need to use only the 3+
>>> ratings.
>>>
>>
>

Re: ItemSimilarityJob's results differ from non-distributed version

Reply via email to