Re: Mahout performance issues

Daniel Zohar Wed, 30 Nov 2011 06:51:28 -0800

I will now try using the latest snapshot from
http://svn.apache.org/repos/asf/mahout/trunk .


I would really prefer to avoid pre-computing the item similarities at the
moment. Do you believe I can achieve good performance without it?

Is there any specific pruning method you would recommend? I guess this is
only relevant in recommendation-time, as the all data is needed in order
for all of my users to be able get recommendations.

Lastly, as I wrote in before, if a user had chose in the past an item which
many other users had chose as well (we have a few items with 100k-400k
associated users) the recommendation is significantly slower (unless he
chose only very few items). Maybe it's a hint for the bottleneck in the
similarities computations?


On Wed, Nov 30, 2011 at 4:24 PM, Sean Owen <[email protected]> wrote:

> Yeah, I agree that using just a handful of candidates is far too few and
> that's not a solution. It should not be so slow even with a reasonable
> number of prefs and users.
>
> Multi-threading *is* a problem insofar as there is no multi-threading
> helping speed up your request. But that's a side issue.
>
> Definitely use the version in Subversion instead of 0.5; I think it will
> help directly.
> 5000 item-item similarities shouldn't take that long to compute.
>
> You can try pre-computing similarities as you mentioned earlier, if you can
> find more RAM.
>
> Of course, I always also recommend looking at pruning; often 90% of your
> data says very little and 10% carries most of the information. Figuring out
> which is which is hard! But it's often possible to drastically simplify
> things by finding some clever way of deciding what's not useful.
>
> And of course you can always start to look at Hadoop-based recommenders.
>
>
> On Wed, Nov 30, 2011 at 2:19 PM, Daniel Zohar <[email protected]> wrote:
>
> > Hi Sean,
> > First of all let me thank you for all your help thus far :)
> >
> > I am using Mahout 0.5.
> > At the moment the application is not live yet, so I assume
> multi-threading
> > is not a problem at the moment.
> >
> > I definitely see that the bottleneck is in the similarities computations.
> > Looking at TopItems:getTopItems, I can see that the method iterates over
> > all the 'possible items' and evaluates them using the Estimator which in
> > turn iterates over all the past user choices for every possible item.
> > Now lets assume a user chose 50 items before and has 100 possible items,
> > that's already 5k item-item similarities to calculate. If I wouldn't cap
> > the possible items, it can wind up at much larger numbers.
> >
> > I would also like to add that although the solution I post before
> improves
> > performance, it extremely damages the quality of the recommendations as
> it
> > checks a smaller pool of possible items.
> >
> > Thanks!
> >
> > On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <[email protected]> wrote:
> >
> > > I have a few more thoughts.
> > >
> > > First, I was wrong about what the first parameter to
> > > SamplingCandidateStrategy means. It's effectively a minimum, rather
> than
> > > maximum; setting to 1 just means it will sample at least 1 pref. I
> think
> > > you figured that out. I think values like (5,1) are probably about
> right
> > > for you.
> > >
> > > I see that your change is to further impose a global cap on the number
> of
> > > candidate items returned. I understand the logic of that -- Sebastian
> > what
> > > do you think? (PS you can probably make that run slightly faster by
> using
> > > LongPrimitiveIterator instead of Iterator<Long>.)
> > >
> > >
> > > Something still feels a bit off here, that's a very long time. Your JVM
> > > params are impeccable, you have a good amount of RAM and strong
> machine.
> > >
> > > Since you're getting speed-up by directly reducing the number of
> > candidate
> > > items, I get the idea that your similarity computations are the
> > bottleneck.
> > > Does any of your profiling confirm that?
> > >
> > > Are you using the latest code? I can think of one change in the last
> few
> > > months that I added (certainly since 0.5) that would speed up
> > > LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so
> this
> > > ought to be very fast.
> > >
> > >
> > > I'll also say that the computation here is not multi-threaded. I had
> > always
> > > sort of thought that, at scale, you'd be getting parallelism from
> > handling
> > > multiple concurrent requests. It would be possible to rewrite a lot of
> > the
> > > internals to compute top recs using multiple threads. That might make
> > > individual requests return faster on a multi-core machine though
> wouldn't
> > > increase overall throughput.
> > >
> > >
> > > On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[email protected]>
> > wrote:
> > >
> > > > Hello all,
> > > > This email follows the correspondence in StackExchange between myself
> > and
> > > > Sean Owen. Please see
> > > >
> > >
> >
> http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues
> > > >
> > > > I'm building a boolean-based recommendation engine with the following
> > > data:
> > > >
> > > >   - 12M users
> > > >   - 2M items
> > > >   - 18M user-item (boolean) choices
> > > >
> > > > The following code is used to build the recommender:
> > > >
> > > > DataModel dataModel = new FileDataModel(new File(dataFile));
> > > > ItemSimilarity itemSimilarity = new CachingItemSimilarity(new
> > > > LogLikelihoodSimilarity(dataModel), dataModel);
> > > > CandidateItemsStrategy candidateItemsStrategy = new
> > > > SamplingCandidateItemsStrategy(20, 5);
> > > > MostSimilarItemsCandidateItemsStrategy
> > > > mostSimilarItemsCandidateItemsStrategy = new
> > > > SamplingCandidateItemsStrategy(20, 5);
> > > >
> > > > this.recommender = new
> > GenericBooleanPrefItemBasedRecommender(dataModel,
> > > > itemSimilarity,
> > > > candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy);
> > > >
> > > > My app runs on a Tomcat with the following JVM arguments:
> > > > *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC
> > > > -XX:+UseParallelOldGC*
> > > >
> > > > Recommendations with the code above works very well for users who
> have
> > > made
> > > > 1-2 choices in the past, but can take over to a minute when a user
> had
> > > made
> > > > tens of choices, especially if one of these choices is a very popular
> > > item
> > > > (i.e. was chosen by many other users).
> > > >
> > > > Even when using the *SamplingCandidateItemsStrategy* with (1,1)
> > > arguments,
> > > > I still did not manage to achieve fast results.
> > > >
> > > > The only way I managed to get somewhat OK results (max recommendation
> > > time
> > > > ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a
> > way
> > > > that *doGetCandidateItems* returns a limited amount of items.
> Following
> > > is
> > > > the doGetCandidateItems method as I re-wrote it:
> > > > http://pastebin.com/6n9C8Pw1
> > > >
> > > > **I think a good response time for recommendations should be less
> than
> > a
> > > > second (preferably less than 500 milliseconds).**
> > > > How can I make Mahout perform better? I have a feeling some
> > optimization
> > > is
> > > > needed both on the *CandidateItemsStrategy* and the *Recommender*
> > itself.
> > > > *
> > > > *
> > > > Thanks in advance!
> > > > Daniel
> > > >
> > >
> >
>

Re: Mahout performance issues

Reply via email to