I assume the parameter does not affect the possibleItemIDs because of the following line:
max = (int) Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier * Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems()))); On Sun, Dec 4, 2011 at 2:59 PM, Daniel Zohar <disso...@gmail.com> wrote: > Sean, your impl. is indeed better than mine but for some reason when I ran > it with for a user with a lot of interactions, I got 2023 possibleItemIDs > (although I used 10,2 in the constructor). > > Sebastian, I will try and expriment also with your patch. I would just > like to add that in my opinion, as long as 'killing items' has to be done > manually, it is not scalable by definition. I personally would always > prefer to avoid these kind of solutions. Also, in my case, the most popular > item has only 3% of the users interacted with, so I suppose that's not > exactly the case as well.. > > > On Sun, Dec 4, 2011 at 2:30 PM, Sebastian Schelter <s...@apache.org> wrote: > >> Hi Daniel, >> >> My view is this: I think you can pretty safely down-sample power users >> like it is done in https://issues.apache.org/jira/browse/MAHOUT-914 >> I did some experiments on the movielens1M dataset that showed that you >> get a negligible error given you look at enough interactions per user: >> >> https://issues.apache.org/jira/secure/attachment/12506028/downsampling.png >> >> I could also verify this on the movielens10M dataset. I think this kind >> of sampling works because the distribution of interactions with items in >> the power-users and in the whole dataset is very similar. Therefore you >> don't really learn anything new from the 'power-users'. The >> 'power-users' might also be crawlers or people sharing accounts in >> practice. >> >> However, I am not sure what happens when you also sample the number of >> items you look at. If I had to decide, I'd rather follow Ted's advice >> and kill super-popular items, as they are not helpful per-se. >> >> But if the additional item sampling helps in your usecase, I don't >> oppose including it in Mahout. I think its good to have a variety of >> candidate item strategies. You should however do some experimenting to >> see how much the sampling affects quality. An A/B test in a real >> application would be the best thing to do. >> >> --sebastian >> >> >> >> On 04.12.2011 13:12, Daniel Zohar wrote: >> > Actually I was referring to Sebastian's. I haven't seen you committedI >> can >> > anything to SamplingCandidateItemsStrategy. Can you tell me in which >> classI can >> > the change appears? >> > >> > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <sro...@gmail.com> wrote: >> > >> >> Are you referring to my patch, MAHOUT-910? >> >> >> >> It does let you specify a hard cap, really -- if you place a limit of >> X, >> >> then at most X^2 item-item associations come out. Before you could not >> >> bound the result, really, since one user could rate a lot of items. >> >> >> >> I think it's slightly more efficient and unbiased as users with few >> ratings >> >> will not have their ratings sampled out, and all users are equally >> likely >> >> to be sampled out. >> >> >> >> What do you think? >> >> Yes you could easily add a secondary cap though as a final filter. >> >> >> >> On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <disso...@gmail.com> >> wrote: >> >> >> >>> Combining the latest commits with my >> >>> optimized-SamplingCandidateItemsStrategy ( >> http://pastebin.com/6n9C8Pw1) >> >>> I achieved satisfying results. All the queries were under one second. >> >>> >> >>> Sebastian, I took a look at your patch and I think it's more practical >> >> than >> >>> the current SamplingCandidateItemsStrategy, however it still doesn't >> put >> >> a >> >>> strict cap on the number of possible item IDs like my implementation >> >> does. >> >>> Perhaps there is room for both implementations? >> >>> >> >>> >> >>> >> >>> On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <s...@apache.org> >> >>> wrote: >> >>> >> >>>> I created a jira to supply a non-distributed counterpart of the >> >>>> sampling that is done in the distributed item similarity computation: >> >>>> >> >>>> https://issues.apache.org/jira/browse/MAHOUT-914 >> >>>> >> >>>> >> >>>> 2011/12/2 Sean Owen <sro...@gmail.com>: >> >>>>> For your purposes, it's LogLikelihoodSimilarity. I made similar >> >> changes >> >>>> in >> >>>>> other files. Ideally, just svn update to get all recent changes. >> >>>>> >> >>>>> On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <disso...@gmail.com> >> >>> wrote: >> >>>>> >> >>>>>> Sean, can you tell me which files have you committed the changes >> to? >> >>>> Thanks >> >>>> >> >>> >> >> >> > >> >> >