I have a few more thoughts.

First, I was wrong about what the first parameter to
SamplingCandidateStrategy means. It's effectively a minimum, rather than
maximum; setting to 1 just means it will sample at least 1 pref. I think
you figured that out. I think values like (5,1) are probably about right
for you.

I see that your change is to further impose a global cap on the number of
candidate items returned. I understand the logic of that -- Sebastian what
do you think? (PS you can probably make that run slightly faster by using
LongPrimitiveIterator instead of Iterator<Long>.)


Something still feels a bit off here, that's a very long time. Your JVM
params are impeccable, you have a good amount of RAM and strong machine.

Since you're getting speed-up by directly reducing the number of candidate
items, I get the idea that your similarity computations are the bottleneck.
Does any of your profiling confirm that?

Are you using the latest code? I can think of one change in the last few
months that I added (certainly since 0.5) that would speed up
LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so this
ought to be very fast.


I'll also say that the computation here is not multi-threaded. I had always
sort of thought that, at scale, you'd be getting parallelism from handling
multiple concurrent requests. It would be possible to rewrite a lot of the
internals to compute top recs using multiple threads. That might make
individual requests return faster on a multi-core machine though wouldn't
increase overall throughput.


On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[email protected]> wrote:

> Hello all,
> This email follows the correspondence in StackExchange between myself and
> Sean Owen. Please see
> http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues
>
> I'm building a boolean-based recommendation engine with the following data:
>
>   - 12M users
>   - 2M items
>   - 18M user-item (boolean) choices
>
> The following code is used to build the recommender:
>
> DataModel dataModel = new FileDataModel(new File(dataFile));
> ItemSimilarity itemSimilarity = new CachingItemSimilarity(new
> LogLikelihoodSimilarity(dataModel), dataModel);
> CandidateItemsStrategy candidateItemsStrategy = new
> SamplingCandidateItemsStrategy(20, 5);
> MostSimilarItemsCandidateItemsStrategy
> mostSimilarItemsCandidateItemsStrategy = new
> SamplingCandidateItemsStrategy(20, 5);
>
> this.recommender = new GenericBooleanPrefItemBasedRecommender(dataModel,
> itemSimilarity,
> candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy);
>
> My app runs on a Tomcat with the following JVM arguments:
> *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC
> -XX:+UseParallelOldGC*
>
> Recommendations with the code above works very well for users who have made
> 1-2 choices in the past, but can take over to a minute when a user had made
> tens of choices, especially if one of these choices is a very popular item
> (i.e. was chosen by many other users).
>
> Even when using the *SamplingCandidateItemsStrategy* with (1,1) arguments,
> I still did not manage to achieve fast results.
>
> The only way I managed to get somewhat OK results (max recommendation time
> ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a way
> that *doGetCandidateItems* returns a limited amount of items. Following is
> the doGetCandidateItems method as I re-wrote it:
> http://pastebin.com/6n9C8Pw1
>
> **I think a good response time for recommendations should be less than a
> second (preferably less than 500 milliseconds).**
> How can I make Mahout perform better? I have a feeling some optimization is
> needed both on the *CandidateItemsStrategy* and the *Recommender* itself.
> *
> *
> Thanks in advance!
> Daniel
>

Reply via email to