Ah, good catch. I will adjust that. I'm happy to make a new example for 'boolean' data, perhaps based on BookCrossing. It would just ignore the rating data.
On Wed, Mar 10, 2010 at 2:46 PM, <[email protected]> wrote: > I think I found the explanation of the poor result and, maybe, the > instability. > > More than 60% of the ratings are 0/10. This is what the publishers of this > dataset call "implicit rating". It means that the book was read (or purchased) > but not rated by the user. > > It seems that BookCrossingDataModel is not aware of that and just considered > them as rating 0. It is therefore not surprising that results are > inconsistant. > An obvious way to solve the problem would be to filter out these implicit > ratings. > > It would be interesting as well to change all ratings to "0" and to consider > all > of them as implicit. There is so far no mahout examples dedicated to > recommendation based on binary data (user as bought item or not), even though > this seems to me like a more common problem than recommendation based on > actual > ratings. > > > Selon Sean Owen <[email protected]>: > >> I see the same variance, but I believe it's due to a small input size. >> At the moment it's using only 5% of the total input, or about 50,000 >> ratings over 5,000 users. That's fairly small. From there, it's also >> looking at only 5% of those users to form neighborhoods. These are >> just too low, and I have increased the amount of data the evaluation >> uses in a few ways, and get much more stable results. >> >> I also switched the algorithm it uses, since the average difference >> was 4 out of 10, which is pretty poor. I think with more research one >> could pick the optimal algorithm, but I just picked something that >> worked a little better (< 3) for now. >> >> On Tue, Mar 9, 2010 at 6:30 PM, Sean Owen <[email protected]> wrote: >> > I see, that definitely doesn't sound right. Let me run it myself >> > tonight when I am home and see what I observe. >> > >> > On Tue, Mar 9, 2010 at 5:40 PM, <[email protected]> wrote: >> >> I did not change anything from the example provided in mahout-example, >> >> development version. It uses 5% for evaluation, which is 5000 instances. >> With >> >> such test set size, the range should not be that big. I suspect that there >> is >> >> something wrong somewhere. >> > >> > > >
