Re: Mahout performance issues

Daniel Zohar Fri, 02 Dec 2011 07:59:49 -0800

It's already hard to say what has the most significant impact here. Just to
summarize, we had basically 3 changes which made an enormous improvement in
performance :
1. Limit the number of 'possibleItemIDs' returned
by SamplingCandidateStrategy
2. Optimizing getNumUsersWithPreferenceFor() with "larger.intersectionSize(
smaller)"
3. Excluding 'singleton users' from
GenericBooleanPrefDataModel.preferenceForItems


I think we already agreed about 1 & 2 (correct me if I'm wrong).
Regarding 3, if we had a Map that mapped itemID to numOfUsers that might
solve the problem and still give a great performance boost. You can see in
my previous results that this change diminishes query time to at least a
quarter (2s vs 0.5s). What do you think?

On Fri, Dec 2, 2011 at 5:38 PM, Sean Owen <[email protected]> wrote:

> On Fri, Dec 2, 2011 at 2:33 PM, Daniel Zohar <[email protected]> wrote:
>
> > On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <[email protected]> wrote:
> >
> > > On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <[email protected]>
> > wrote:
> > >
> > > > I'm already capping it at 100. If this will be my last resort, I will
> > > > decrease it more :)
> > > >
> > >
> > > This just can't be... 100 item-item similarities takes milliseconds to
> > > compute. Something else is going on.
> > > I should make a JIRA to propose my own version of this filtering just
> to
> > > make sure we're talking about the same thing.
> > >
> > >
> > >
> > I limit only the possibleItemIDs in my altered version
> > of SamplingCandidateItemsStrategy
> > Sine TopItems.getTopItems() computes similarities for every previously
> > interacted item with the set of 'possibleItemIDs', you are correct only
> > when the user have a single interaction. However, if the user had made 20
> > interactions, we're talking about 2000 item-item similarities.
> >
>
> Sure, but why not limit those 2000 interactions to 100?
>
> I'm not sure "100" is the optimal number or not, but clearly, at smaller
> scale, this all runs quite fast and produces reasonable results. Say that,
> at some smaller scale, it considers on average X interactions. If you turn
> this down such that only X interactions are considered here, I don't see
> why you wouldn't get similar performance and quality.
>
> I was actually considering a patch that would simply apply the max both to
> the number of users sampled per item, but the number of items from each of
> those users. If you clamped it at 10, then each item you've rated would
> produce at most 10*10 interactions. We only limit one of those things now.
>
> Well... maybe with the change below it speeds up further so that you can
> get away with more data in less time, so that it is much easier to find a
> good tradeoff.
>
>
>
> >
> > You nailed it! It extremely improves the performance. Without my last
> fix,
> > we're talking about max 2s with my fix, it didn't go over 0.5s!
> >
> >
> >
> .. but does this change alone produce a good speedup?
>
>
>
> > I still don't see any problem with not including 'singleton users' inside
> > preferenceForItems as long as preferenceFromUsers stays intact. Can you
> > please elaborate more on the problem as you see it? I feel we're some
> kind
> > of communication problem :P
> >
>
> The calculations are just wrong then, since you don't have the right user
> counts per item. Methods that answer that question give the wrong answer;
> similarity metrics like log-likelihood give slightly wrong results too. At
> this point I don't think it's good to sacrifice correctness for an
> optimization when there seems to be (?) a way to have most of the speed up
> without any correctness problem.
>

Re: Mahout performance issues

Reply via email to