Hi Greg, Can you give an example of two items that where similar in the non-distributed case but did not appear in the distributed version?
A small tip on the side: For implicit data, you should also include "negative" ratings as those still contain a lot of information about the taste and will of engagement of the user. No need to use only the 3+ ratings. --sebastian On 25.11.2011 09:27, Greg H wrote: > Hi Sebastian, > > I converted the dataset by simply keeping all user/item pairs that had a > rating of above 3. I'm also using GenericItemBasedRecommender's > mostSimilarItems method instead of the recommend method to make > recommendations. > > I'm certainly open to suggestions on better evaluation metrics. I'm just > using the top 5 because it was easy to implement. > > Thanks, > Greg > > On Fri, Nov 25, 2011 at 4:03 PM, Sebastian Schelter <[email protected]> wrote: > >> Hi Greg, >> >> You should get the same results, can you describe exactly how you >> converted the dataset? I'd like to try this myself, maybe you found some >> subtle bug. >> >> I also have doubts whether taking the precision of the top 5 recommended >> items is really a good quality measure. >> >> --sebastian >> >> On 25.11.2011 02:41, Greg H wrote: >>> Thanks for the replies Sebastian and Sean. I looked at the similarity >>> values and they are the same, but ItemSimilarityJob is calculating fewer >> of >>> them. So it must be still doing some sort of sampling. I thought that I >>> could force it to use all of the data by setting maxPrefsPerUser >>> sufficiently large. Could there be another reason for it not to calculate >>> all of the similarity values? >>> >>> I also tried to use a smaller amount of similarItemsPerItem but this >> leads >>> to worse results. >>> >>> Thanks again, >>> Greg >>> >> >> >
