Yes, I don't know if removing that data would improve results. It might
mean you can compute things faster, at little or no observable loss in
quality of the results.

I'm not sure, but you probably have repeat purchases of the same item, and
items of different value. Working in that data may help here since you have
relatively few items.


On Thu, Feb 14, 2013 at 10:25 AM, Julian Ortega <jorte...@gmail.com> wrote:

> Hi everyone.
>
> I have a data set that looks like this:
>
> Number of users: 198651
> Number of items: 9972
>
> Statistics of purchases from users
> --------------------------------
> mean number of purchases
> 3.3
> stdDev number of purchases
> 3.5
> min number of purchases
> 1
> max number of purchases
> 176
> median number of purchases
> 2
>
> Statistics of purchased items
> --------------------------------
> mean number of times bought
> 65.1
> stdDev number of times bought
> 120.7
> min number of times bought
> 1
> max number of times bought
> 3278
> median number of times bought
> 25
>
> I'm using a GenericItemBasedRecommender with LogLikelihoodSimilarity to
> generate a list of similar items. However, I've been wondering how should I
> pre-process the data between passing it to the recommender to improve the
> quality.
>
> Some things I have consider are:
>
>    - Removing all users that have 5 or less purchases
>    - Removing all items that have been purchased 5 or less times
>
> In general terms, would that make sense? Presumably it will make the matrix
> less sparse and also avoid weak associations, albeit if I'm not
> mistaken LogLikelihood account for low number of occurrences.
>
> Any thoughts?
>
> Thanks,
> Julian
>

Reply via email to