Hi everyone.

I have a data set that looks like this:

Number of users: 198651
Number of items: 9972

Statistics of purchases from users
--------------------------------
mean number of purchases
3.3
stdDev number of purchases
3.5
min number of purchases
1
max number of purchases
176
median number of purchases
2

Statistics of purchased items
--------------------------------
mean number of times bought
65.1
stdDev number of times bought
120.7
min number of times bought
1
max number of times bought
3278
median number of times bought
25

I'm using a GenericItemBasedRecommender with LogLikelihoodSimilarity to
generate a list of similar items. However, I've been wondering how should I
pre-process the data between passing it to the recommender to improve the
quality.

Some things I have consider are:

   - Removing all users that have 5 or less purchases
   - Removing all items that have been purchased 5 or less times

In general terms, would that make sense? Presumably it will make the matrix
less sparse and also avoid weak associations, albeit if I'm not
mistaken LogLikelihood account for low number of occurrences.

Any thoughts?

Thanks,
Julian

Reply via email to