Re: Mahout performance issues

Daniel Zohar Wed, 30 Nov 2011 07:19:43 -0800

I just tested the app with Mahout 0.6.
There seems to be a small performance improvement, but still
recommendations for the 'heavy users' take between 1-5 seconds.


On Wed, Nov 30, 2011 at 4:50 PM, Daniel Zohar <[email protected]> wrote:

> I will now try using the latest snapshot from
> http://svn.apache.org/repos/asf/mahout/trunk .
>
> I would really prefer to avoid pre-computing the item similarities at the
> moment. Do you believe I can achieve good performance without it?
>
> Is there any specific pruning method you would recommend? I guess this is
> only relevant in recommendation-time, as the all data is needed in order
> for all of my users to be able get recommendations.
>
> Lastly, as I wrote in before, if a user had chose in the past an item
> which many other users had chose as well (we have a few items with
> 100k-400k associated users) the recommendation is significantly slower
> (unless he chose only very few items). Maybe it's a hint for the bottleneck
> in the similarities computations?
>
>
> On Wed, Nov 30, 2011 at 4:24 PM, Sean Owen <[email protected]> wrote:
>
>> Yeah, I agree that using just a handful of candidates is far too few and
>> that's not a solution. It should not be so slow even with a reasonable
>> number of prefs and users.
>>
>> Multi-threading *is* a problem insofar as there is no multi-threading
>> helping speed up your request. But that's a side issue.
>>
>> Definitely use the version in Subversion instead of 0.5; I think it will
>> help directly.
>> 5000 item-item similarities shouldn't take that long to compute.
>>
>> You can try pre-computing similarities as you mentioned earlier, if you
>> can
>> find more RAM.
>>
>> Of course, I always also recommend looking at pruning; often 90% of your
>> data says very little and 10% carries most of the information. Figuring
>> out
>> which is which is hard! But it's often possible to drastically simplify
>> things by finding some clever way of deciding what's not useful.
>>
>> And of course you can always start to look at Hadoop-based recommenders.
>>
>>
>> On Wed, Nov 30, 2011 at 2:19 PM, Daniel Zohar <[email protected]> wrote:
>>
>> > Hi Sean,
>> > First of all let me thank you for all your help thus far :)
>> >
>> > I am using Mahout 0.5.
>> > At the moment the application is not live yet, so I assume
>> multi-threading
>> > is not a problem at the moment.
>> >
>> > I definitely see that the bottleneck is in the similarities
>> computations.
>> > Looking at TopItems:getTopItems, I can see that the method iterates over
>> > all the 'possible items' and evaluates them using the Estimator which in
>> > turn iterates over all the past user choices for every possible item.
>> > Now lets assume a user chose 50 items before and has 100 possible items,
>> > that's already 5k item-item similarities to calculate. If I wouldn't cap
>> > the possible items, it can wind up at much larger numbers.
>> >
>> > I would also like to add that although the solution I post before
>> improves
>> > performance, it extremely damages the quality of the recommendations as
>> it
>> > checks a smaller pool of possible items.
>> >
>> > Thanks!
>> >
>> > On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <[email protected]> wrote:
>> >
>> > > I have a few more thoughts.
>> > >
>> > > First, I was wrong about what the first parameter to
>> > > SamplingCandidateStrategy means. It's effectively a minimum, rather
>> than
>> > > maximum; setting to 1 just means it will sample at least 1 pref. I
>> think
>> > > you figured that out. I think values like (5,1) are probably about
>> right
>> > > for you.
>> > >
>> > > I see that your change is to further impose a global cap on the
>> number of
>> > > candidate items returned. I understand the logic of that -- Sebastian
>> > what
>> > > do you think? (PS you can probably make that run slightly faster by
>> using
>> > > LongPrimitiveIterator instead of Iterator<Long>.)
>> > >
>> > >
>> > > Something still feels a bit off here, that's a very long time. Your
>> JVM
>> > > params are impeccable, you have a good amount of RAM and strong
>> machine.
>> > >
>> > > Since you're getting speed-up by directly reducing the number of
>> > candidate
>> > > items, I get the idea that your similarity computations are the
>> > bottleneck.
>> > > Does any of your profiling confirm that?
>> > >
>> > > Are you using the latest code? I can think of one change in the last
>> few
>> > > months that I added (certainly since 0.5) that would speed up
>> > > LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so
>> this
>> > > ought to be very fast.
>> > >
>> > >
>> > > I'll also say that the computation here is not multi-threaded. I had
>> > always
>> > > sort of thought that, at scale, you'd be getting parallelism from
>> > handling
>> > > multiple concurrent requests. It would be possible to rewrite a lot of
>> > the
>> > > internals to compute top recs using multiple threads. That might make
>> > > individual requests return faster on a multi-core machine though
>> wouldn't
>> > > increase overall throughput.
>> > >
>> > >
>> > > On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[email protected]>
>> > wrote:
>> > >
>> > > > Hello all,
>> > > > This email follows the correspondence in StackExchange between
>> myself
>> > and
>> > > > Sean Owen. Please see
>> > > >
>> > >
>> >
>> http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues
>> > > >
>> > > > I'm building a boolean-based recommendation engine with the
>> following
>> > > data:
>> > > >
>> > > >   - 12M users
>> > > >   - 2M items
>> > > >   - 18M user-item (boolean) choices
>> > > >
>> > > > The following code is used to build the recommender:
>> > > >
>> > > > DataModel dataModel = new FileDataModel(new File(dataFile));
>> > > > ItemSimilarity itemSimilarity = new CachingItemSimilarity(new
>> > > > LogLikelihoodSimilarity(dataModel), dataModel);
>> > > > CandidateItemsStrategy candidateItemsStrategy = new
>> > > > SamplingCandidateItemsStrategy(20, 5);
>> > > > MostSimilarItemsCandidateItemsStrategy
>> > > > mostSimilarItemsCandidateItemsStrategy = new
>> > > > SamplingCandidateItemsStrategy(20, 5);
>> > > >
>> > > > this.recommender = new
>> > GenericBooleanPrefItemBasedRecommender(dataModel,
>> > > > itemSimilarity,
>> > > > candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy);
>> > > >
>> > > > My app runs on a Tomcat with the following JVM arguments:
>> > > > *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC
>> > > > -XX:+UseParallelOldGC*
>> > > >
>> > > > Recommendations with the code above works very well for users who
>> have
>> > > made
>> > > > 1-2 choices in the past, but can take over to a minute when a user
>> had
>> > > made
>> > > > tens of choices, especially if one of these choices is a very
>> popular
>> > > item
>> > > > (i.e. was chosen by many other users).
>> > > >
>> > > > Even when using the *SamplingCandidateItemsStrategy* with (1,1)
>> > > arguments,
>> > > > I still did not manage to achieve fast results.
>> > > >
>> > > > The only way I managed to get somewhat OK results (max
>> recommendation
>> > > time
>> > > > ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a
>> > way
>> > > > that *doGetCandidateItems* returns a limited amount of items.
>> Following
>> > > is
>> > > > the doGetCandidateItems method as I re-wrote it:
>> > > > http://pastebin.com/6n9C8Pw1
>> > > >
>> > > > **I think a good response time for recommendations should be less
>> than
>> > a
>> > > > second (preferably less than 500 milliseconds).**
>> > > > How can I make Mahout perform better? I have a feeling some
>> > optimization
>> > > is
>> > > > needed both on the *CandidateItemsStrategy* and the *Recommender*
>> > itself.
>> > > > *
>> > > > *
>> > > > Thanks in advance!
>> > > > Daniel
>> > > >
>> > >
>> >
>>
>
>

Re: Mahout performance issues

Reply via email to