Re: Mahout performance issues

Daniel Zohar Fri, 02 Dec 2011 06:34:33 -0800

On Fri, Dec 2, 2011 at 1:53 PM, Sean Owen <[email protected]> wrote:

> On Fri, Dec 2, 2011 at 11:28 AM, Daniel Zohar <[email protected]> wrote:
>
> > I'm already capping it at 100. If this will be my last resort, I will
> > decrease it more :)
> >
>
> This just can't be... 100 item-item similarities takes milliseconds to
> compute. Something else is going on.
> I should make a JIRA to propose my own version of this filtering just to
> make sure we're talking about the same thing.
>
>
>
I limit only the possibleItemIDs in my altered version
of SamplingCandidateItemsStrategy
Sine TopItems.getTopItems() computes similarities for every previously
interacted item with the set of 'possibleItemIDs', you are correct only
when the user have a single interaction. However, if the user had made 20
interactions, we're talking about 2000 item-item similarities.



> >
> > You know this code way better than I do, so perhaps I am missing
> something
> > here. But as I see it (and I tested it as well) the users data point
> > remains intact. That's because the preferenceFromUsers Set remains the
> same
> > while only preferenceForItems is optimized. The main reason it improves
> > performance is because of the bottleneck we diagnosed before -
> > `GenericBooleanPrefDataModel.getNumUsersWithPreferenceFor` which in turn
> > calls `FastIDSet.intersectionSize`. Now, if we know _for sure_ that a
> user
> > interacted with a single item only, what's the point of checking every
> time
> > if it had interacted with other items? (I hope I make myself clear)
> > Because in my data set, we have over 80% of users which had a single
> > interaction, it gives such a performance boost. (I believe this case
> might
> > be more common than one might think in web apps)
> >
>
> Let me propose a better way to address that bottleneck. I think the problem
> is that the intersection computation is dumb, and should really compute
> "larger.intersectionSize(smaller)".
>
> Try ending getNumUsersWithPreferenceFor() with:
>
>    return userIDs1.size() < userIDs2.size() ?
>        userIDs2.intersectionSize(userIDs1) :
>        userIDs1.intersectionSize(userIDs2);
>
> It won't produce the same speedup, but it's more correct than omitting this
> data just to get this effect. If it gets 80% of the speedup, that's a great
> win.
>

You nailed it! It extremely improves the performance. Without my last fix,
we're talking about max 2s with my fix, it didn't go over 0.5s!


I still don't see any problem with not including 'singleton users' inside
preferenceForItems as long as preferenceFromUsers stays intact. Can you
please elaborate more on the problem as you see it? I feel we're some kind
of communication problem :P

Re: Mahout performance issues

Reply via email to