[email protected] wrote:
> I think I found the explanation of the poor result and, maybe, the 
> instability.
>
> More than 60% of the ratings are 0/10. This is what the publishers of this
> dataset call "implicit rating". It means that the book was read (or purchased)
> but not rated by the user.
>
> It seems that BookCrossingDataModel is not aware of that and just considered
> them as rating 0. It is therefore not surprising that results are 
> inconsistant.
> An obvious way to solve the problem would be to filter out these implicit
> ratings.
>
> It would be interesting as well to change all ratings to "0" and to consider 
> all
> of them as implicit. There is so far no mahout examples dedicated to
> recommendation based on binary data (user as bought item or not), even though
> this seems to me like a more common problem than recommendation based on 
> actual
> ratings.
>   
You'll be willing to look at GenericBooleanPrefDataModel,
GenericItemBasedRecommender, GenericBooleanPrefUserBasedRecommender
and similarities supporting boolean preferences like Tanimoto and
LogLikelihood.



> Selon Sean Owen <[email protected]>:
>
>   
>> I see the same variance, but I believe it's due to a small input size.
>> At the moment it's using only 5% of the total input, or about 50,000
>> ratings over 5,000 users. That's fairly small. From there, it's also
>> looking at only 5% of those users to form neighborhoods. These are
>> just too low, and I have increased the amount of data the evaluation
>> uses in a few ways, and get much more stable results.
>>
>> I also switched the algorithm it uses, since the average difference
>> was 4 out of 10, which is pretty poor. I think with more research one
>> could pick the optimal algorithm, but I just picked something that
>> worked a little better (< 3) for now.
>>
>> On Tue, Mar 9, 2010 at 6:30 PM, Sean Owen <[email protected]> wrote:
>>     
>>> I see, that definitely doesn't sound right. Let me run it myself
>>> tonight when I am home and see what I observe.
>>>
>>> On Tue, Mar 9, 2010 at 5:40 PM, Â <[email protected]> wrote:
>>>       
>>>> I did not change anything from the example provided in mahout-example,
>>>> development version. It uses 5% for evaluation, which is 5000 instances.
>>>>         
>> With
>>     
>>>> such test set size, the range should not be that big. I suspect that there
>>>>         
>> is
>>     
>>>> something wrong somewhere.
>>>>         
>
>
>
>   


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
[email protected] http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to [email protected] in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.


Reply via email to