Have a look at the patch attached to MAHOUT-910. I have not committed it yet so as to allow review. https://issues.apache.org/jira/browse/MAHOUT-910
The current implementation samples users. MAHOUT-914 samples items from users. MAHOUT-910 samples both. What's most ideal? I had supposed we want to sample both since you might have a lot of users for one rated item, or many items from each user. That would let you bound both. On Sun, Dec 4, 2011 at 12:12 PM, Daniel Zohar <[email protected]> wrote: > Actually I was referring to Sebastian's. I haven't seen you committed > anything to SamplingCandidateItemsStrategy. Can you tell me in which class > the change appears? > > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <[email protected]> wrote: > > > Are you referring to my patch, MAHOUT-910? > > > > It does let you specify a hard cap, really -- if you place a limit of X, > > then at most X^2 item-item associations come out. Before you could not > > bound the result, really, since one user could rate a lot of items. > > > > I think it's slightly more efficient and unbiased as users with few > ratings > > will not have their ratings sampled out, and all users are equally likely > > to be sampled out. > > > > What do you think? > > Yes you could easily add a secondary cap though as a final filter. > > > > On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[email protected]> > wrote: > > > > > Combining the latest commits with my > > > optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1 > ) > > > I achieved satisfying results. All the queries were under one second. > > > > > > Sebastian, I took a look at your patch and I think it's more practical > > than > > > the current SamplingCandidateItemsStrategy, however it still doesn't > put > > a > > > strict cap on the number of possible item IDs like my implementation > > does. > > > Perhaps there is room for both implementations? > > > > > > > > > > > > On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[email protected]> > > > wrote: > > > > > > > I created a jira to supply a non-distributed counterpart of the > > > > sampling that is done in the distributed item similarity computation: > > > > > > > > https://issues.apache.org/jira/browse/MAHOUT-914 > > > > > > > > > > > > 2011/12/2 Sean Owen <[email protected]>: > > > > > For your purposes, it's LogLikelihoodSimilarity. I made similar > > changes > > > > in > > > > > other files. Ideally, just svn update to get all recent changes. > > > > > > > > > > On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[email protected]> > > > wrote: > > > > > > > > > >> Sean, can you tell me which files have you committed the changes > to? > > > > Thanks > > > > > > > > > >
