Hi everyone. I have a data set that looks like this:
Number of users: 198651 Number of items: 9972 Statistics of purchases from users -------------------------------- mean number of purchases 3.3 stdDev number of purchases 3.5 min number of purchases 1 max number of purchases 176 median number of purchases 2 Statistics of purchased items -------------------------------- mean number of times bought 65.1 stdDev number of times bought 120.7 min number of times bought 1 max number of times bought 3278 median number of times bought 25 I'm using a GenericItemBasedRecommender with LogLikelihoodSimilarity to generate a list of similar items. However, I've been wondering how should I pre-process the data between passing it to the recommender to improve the quality. Some things I have consider are: - Removing all users that have 5 or less purchases - Removing all items that have been purchased 5 or less times In general terms, would that make sense? Presumably it will make the matrix less sparse and also avoid weak associations, albeit if I'm not mistaken LogLikelihood account for low number of occurrences. Any thoughts? Thanks, Julian