Sean, your impl. is indeed better than mine but for some reason when I ran it with for a user with a lot of interactions, I got 2023 possibleItemIDs (although I used 10,2 in the constructor).
Sebastian, I will try and expriment also with your patch. I would just like to add that in my opinion, as long as 'killing items' has to be done manually, it is not scalable by definition. I personally would always prefer to avoid these kind of solutions. Also, in my case, the most popular item has only 3% of the users interacted with, so I suppose that's not exactly the case as well.. On Sun, Dec 4, 2011 at 2:30 PM, Sebastian Schelter <s...@apache.org> wrote: > Hi Daniel, > > My view is this: I think you can pretty safely down-sample power users > like it is done in https://issues.apache.org/jira/browse/MAHOUT-914 > I did some experiments on the movielens1M dataset that showed that you > get a negligible error given you look at enough interactions per user: > > https://issues.apache.org/jira/secure/attachment/12506028/downsampling.png > > I could also verify this on the movielens10M dataset. I think this kind > of sampling works because the distribution of interactions with items in > the power-users and in the whole dataset is very similar. Therefore you > don't really learn anything new from the 'power-users'. The > 'power-users' might also be crawlers or people sharing accounts in > practice. > > However, I am not sure what happens when you also sample the number of > items you look at. If I had to decide, I'd rather follow Ted's advice > and kill super-popular items, as they are not helpful per-se. > > But if the additional item sampling helps in your usecase, I don't > oppose including it in Mahout. I think its good to have a variety of > candidate item strategies. You should however do some experimenting to > see how much the sampling affects quality. An A/B test in a real > application would be the best thing to do. > > --sebastian > > > > On 04.12.2011 13:12, Daniel Zohar wrote: > > Actually I was referring to Sebastian's. I haven't seen you committedI > can > > anything to SamplingCandidateItemsStrategy. Can you tell me in which > classI can > > the change appears? > > > > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <sro...@gmail.com> wrote: > > > >> Are you referring to my patch, MAHOUT-910? > >> > >> It does let you specify a hard cap, really -- if you place a limit of X, > >> then at most X^2 item-item associations come out. Before you could not > >> bound the result, really, since one user could rate a lot of items. > >> > >> I think it's slightly more efficient and unbiased as users with few > ratings > >> will not have their ratings sampled out, and all users are equally > likely > >> to be sampled out. > >> > >> What do you think? > >> Yes you could easily add a secondary cap though as a final filter. > >> > >> On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <disso...@gmail.com> > wrote: > >> > >>> Combining the latest commits with my > >>> optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1 > ) > >>> I achieved satisfying results. All the queries were under one second. > >>> > >>> Sebastian, I took a look at your patch and I think it's more practical > >> than > >>> the current SamplingCandidateItemsStrategy, however it still doesn't > put > >> a > >>> strict cap on the number of possible item IDs like my implementation > >> does. > >>> Perhaps there is room for both implementations? > >>> > >>> > >>> > >>> On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <s...@apache.org> > >>> wrote: > >>> > >>>> I created a jira to supply a non-distributed counterpart of the > >>>> sampling that is done in the distributed item similarity computation: > >>>> > >>>> https://issues.apache.org/jira/browse/MAHOUT-914 > >>>> > >>>> > >>>> 2011/12/2 Sean Owen <sro...@gmail.com>: > >>>>> For your purposes, it's LogLikelihoodSimilarity. I made similar > >> changes > >>>> in > >>>>> other files. Ideally, just svn update to get all recent changes. > >>>>> > >>>>> On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <disso...@gmail.com> > >>> wrote: > >>>>> > >>>>>> Sean, can you tell me which files have you committed the changes to? > >>>> Thanks > >>>> > >>> > >> > > > >