On 29.11.2011 11:52, Sean Owen wrote:
> Yeah the non-distributed implementation returns NaN in this case, which is
> a bit of an abuse, since it is defined to be 0. In practice I have always
> thought that is the right thing to do for consistency with other
> implementations, where no overlap means undefined similarity. You could
> argue it either way.

I created https://issues.apache.org/jira/browse/MAHOUT-902 for this.
Should make a nice starter issue to work on, therefore I labeled it with
MAHOUT_INTRO_CONTRIBUTE

--sebastian


> 
> On Tue, Nov 29, 2011 at 9:14 AM, Sebastian Schelter <[email protected]> wrote:
> 
>> Hi Greg,
>>
>> Thank you for your time debugging this!
>>
>> Maybe we should simply make TanimotoCoefficientSimilarity return
>> Double.NaN in case of no overlap?
>>
>> --sebastian
>>
>> On 29.11.2011 06:28, Greg H wrote:
>>> Sorry for taking so long to reply but I think I found where the problem
>> is.
>>> After comparing the similarities more closely I found out that the
>> problem
>>> wasn't with ItemSimilarityJob. The difference in the number of
>> similarities
>>> being computed was simply because ItemSimilarityJob is excluding the
>>> item-item similarities that equal zero. However, this causes a problem
>> when
>>> using GenericItemBasedRecommender to make recommendations.
>>>
>>> The problem starts from line 345 of GenericItemBasedRecommender:
>>>
>>> if (excludeItemIfNotSimilarToAll || !Double.isNaN(estimate)) {
>>>   average.addDatum(estimate);
>>> }
>>>
>>> This works as expected if you use ItemSimilarityJob because the estimate
>>> for dissimilar items will be NaN and this causes the final average to be
>>> zero. However, if you use the non-distributed version to calculate the
>>> similarities the estimate for dissimilar items will be zero, which will
>> not
>>> necessarily cause the final average to be zero thus still allowing this
>>> item to be used even though it is not similar to all of the user's
>>> preferred items. So either the non-distributed version needs to be
>> modified
>>> to not store similarities equaling zero, or the code above needs to be
>>> changed to handle the case where estimate is zero.
>>>
>>> After fixing this calculating similarities using the distributed and
>>> non-distributed version gives the same results in my experiments.
>>>
>>> Thanks,
>>> Greg
>>>
>>> On Sun, Nov 27, 2011 at 4:51 AM, Ted Dunning <[email protected]>
>> wrote:
>>>
>>>> This is good advice, but in many cases, the number of positive ratings
>> far
>>>> outnumbers the number of negatives so the negatives may have limited
>>>> impact.
>>>>
>>>> You absolutely should produce a histogram to see what kinds of ratings
>> you
>>>> are getting and how many users are producing them.
>>>>
>>>> You should also consider implicit feedback.  This tends to be much
>> higher
>>>> value than ratings for two reasons:
>>>>
>>>> 1) there is usually 100x more of it
>>>>
>>>> 2) you generally care more about what people do than what they say.
>>>>  Implicit feedback is based on what people do and ratings are what they
>>>> say.  Inferring what people will do based on what they say is more
>>>> difficult than inferring what they will do based on what they do.
>>>>
>>>> On Sat, Nov 26, 2011 at 12:59 AM, Sebastian Schelter <[email protected]>
>>>> wrote:
>>>>
>>>>> A small tip on the side: For implicit data, you should also include
>>>>> "negative" ratings as those still contain a lot of information about
>> the
>>>>> taste and will of engagement of the user. No need to use only the 3+
>>>>> ratings.
>>>>>
>>>>
>>>
>>
>>
> 

Reply via email to