Re: Mahout performance issues

Daniel Zohar Sun, 04 Dec 2011 05:00:07 -0800

Sean, your impl. is indeed better than mine but for some reason when I ran
it with for a user with a lot of interactions, I got 2023 possibleItemIDs
(although I used 10,2 in the constructor).


Sebastian, I will try and expriment also with your patch. I would just like
to add that in my opinion, as long as 'killing items' has to be done
manually, it is not scalable by definition. I personally would always
prefer to avoid these kind of solutions. Also, in my case, the most popular
item has only 3% of the users interacted with, so I suppose that's not
exactly the case as well..

On Sun, Dec 4, 2011 at 2:30 PM, Sebastian Schelter <s...@apache.org> wrote:

> Hi Daniel,
>
> My view is this: I think you can pretty safely down-sample power users
> like it is done in https://issues.apache.org/jira/browse/MAHOUT-914
> I did some experiments on the movielens1M dataset that showed that you
> get a negligible error given you look at enough interactions per user:
>
> https://issues.apache.org/jira/secure/attachment/12506028/downsampling.png
>
> I could also verify this on the movielens10M dataset. I think this kind
> of sampling works because the distribution of interactions with items in
> the power-users and in the whole dataset is very similar. Therefore you
> don't really learn anything new from the 'power-users'. The
> 'power-users' might also be crawlers or people sharing accounts in
> practice.
>
> However, I am not sure what happens when you also sample the number of
> items you look at. If I had to decide, I'd rather follow Ted's advice
> and kill super-popular items, as they are not helpful per-se.
>
> But if the additional item sampling helps in your usecase, I don't
> oppose including it in Mahout. I think its good to have a variety of
> candidate item strategies. You should however do some experimenting to
> see how much the sampling affects quality. An A/B test in a real
> application would be the best thing to do.
>
> --sebastian
>
>
>
> On 04.12.2011 13:12, Daniel Zohar wrote:
> > Actually I was referring to Sebastian's. I haven't seen you committedI
> can
> > anything to SamplingCandidateItemsStrategy. Can you tell me in which
> classI can
> > the change appears?
> >
> > On Sun, Dec 4, 2011 at 2:06 PM, Sean Owen <sro...@gmail.com> wrote:
> >
> >> Are you referring to my patch, MAHOUT-910?
> >>
> >> It does let you specify a hard cap, really -- if you place a limit of X,
> >> then at most X^2 item-item associations come out. Before you could not
> >> bound the result, really, since one user could rate a lot of items.
> >>
> >> I think it's slightly more efficient and unbiased as users with few
> ratings
> >> will not have their ratings sampled out, and all users are equally
> likely
> >> to be sampled out.
> >>
> >> What do you think?
> >> Yes you could easily add a secondary cap though as a final filter.
> >>
> >> On Sun, Dec 4, 2011 at 11:43 AM, Daniel Zohar <disso...@gmail.com>
> wrote:
> >>
> >>> Combining the latest commits with my
> >>> optimized-SamplingCandidateItemsStrategy (http://pastebin.com/6n9C8Pw1
> )
> >>> I achieved satisfying results. All the queries were under one second.
> >>>
> >>> Sebastian, I took a look at your patch and I think it's more practical
> >> than
> >>> the current SamplingCandidateItemsStrategy, however it still doesn't
> put
> >> a
> >>> strict cap on the number of possible item IDs like my implementation
> >> does.
> >>> Perhaps there is room for both implementations?
> >>>
> >>>
> >>>
> >>> On Sun, Dec 4, 2011 at 11:13 AM, Sebastian Schelter <s...@apache.org>
> >>> wrote:
> >>>
> >>>> I created a jira to supply a non-distributed counterpart of the
> >>>> sampling that is done in the distributed item similarity computation:
> >>>>
> >>>> https://issues.apache.org/jira/browse/MAHOUT-914
> >>>>
> >>>>
> >>>> 2011/12/2 Sean Owen <sro...@gmail.com>:
> >>>>> For your purposes, it's LogLikelihoodSimilarity. I made similar
> >> changes
> >>>> in
> >>>>> other files. Ideally, just svn update to get all recent changes.
> >>>>>
> >>>>> On Fri, Dec 2, 2011 at 6:43 PM, Daniel Zohar <disso...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Sean, can you tell me which files have you committed the changes to?
> >>>> Thanks
> >>>>
> >>>
> >>
> >
>
>

Re: Mahout performance issues

Reply via email to