Re: Mahout performance issues

Sean Owen Sun, 04 Dec 2011 04:20:36 -0800

Have a look at the patch attached to MAHOUT-910. I have not committed it
yet so as to allow review.
https://issues.apache.org/jira/browse/MAHOUT-910


The current implementation samples users. MAHOUT-914 samples items from
users. MAHOUT-910 samples both.
What's most ideal?

I had supposed we want to sample both since you might have a lot of users
for one rated item, or many items from each user. That would let you bound
both.

On Sun, Dec 4, 2011 at 12:12 PM, Daniel Zohar <[email protected]> wrote:

> Actually I was referring to Sebastian's. I haven't seen you committed
> anything to SamplingCandidateItemsStrategy. Can you tell me in which class
> the change appears?
>
> On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <[email protected]> wrote:
>
> > Are you referring to my patch, MAHOUT-910?
> >
> > It does let you specify a hard cap, really -- if you place a limit of X,
> > then at most X^2 item-item associations come out. Before you could not
> > bound the result, really, since one user could rate a lot of items.
> >
> > I think it's slightly more efficient and unbiased as users with few
> ratings
> > will not have their ratings sampled out, and all users are equally likely
> > to be sampled out.
> >
> > What do you think?
> > Yes you could easily add a secondary cap though as a final filter.
> >
> > On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <[email protected]>
> wrote:
> >
> > > Combining the latest commits with my
> > > optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1
> )
> > > I achieved satisfying results. All the queries were under one second.
> > >
> > > Sebastian, I took a look at your patch and I think it's more practical
> > than
> > > the current SamplingCandidateItemsStrategy, however it still doesn't
> put
> > a
> > > strict cap on the number of possible item IDs like my implementation
> > does.
> > > Perhaps there is room for both implementations?
> > >
> > >
> > >
> > > On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <[email protected]>
> > > wrote:
> > >
> > > > I created a jira to supply a non-distributed counterpart of the
> > > > sampling that is done in the distributed item similarity computation:
> > > >
> > > > https://issues.apache.org/jira/browse/MAHOUT-914
> > > >
> > > >
> > > > 2011/12/2 Sean Owen <[email protected]>:
> > > > > For your purposes, it's LogLikelihoodSimilarity. I made similar
> > changes
> > > > in
> > > > > other files. Ideally, just svn update to get all recent changes.
> > > > >
> > > > > On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <[email protected]>
> > > wrote:
> > > > >
> > > > >> Sean, can you tell me which files have you committed the changes
> to?
> > > > Thanks
> > > >
> > >
> >
>

Re: Mahout performance issues

Reply via email to