I have a few more thoughts. First, I was wrong about what the first parameter to SamplingCandidateStrategy means. It's effectively a minimum, rather than maximum; setting to 1 just means it will sample at least 1 pref. I think you figured that out. I think values like (5,1) are probably about right for you.
I see that your change is to further impose a global cap on the number of candidate items returned. I understand the logic of that -- Sebastian what do you think? (PS you can probably make that run slightly faster by using LongPrimitiveIterator instead of Iterator<Long>.) Something still feels a bit off here, that's a very long time. Your JVM params are impeccable, you have a good amount of RAM and strong machine. Since you're getting speed-up by directly reducing the number of candidate items, I get the idea that your similarity computations are the bottleneck. Does any of your profiling confirm that? Are you using the latest code? I can think of one change in the last few months that I added (certainly since 0.5) that would speed up LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so this ought to be very fast. I'll also say that the computation here is not multi-threaded. I had always sort of thought that, at scale, you'd be getting parallelism from handling multiple concurrent requests. It would be possible to rewrite a lot of the internals to compute top recs using multiple threads. That might make individual requests return faster on a multi-core machine though wouldn't increase overall throughput. On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[email protected]> wrote: > Hello all, > This email follows the correspondence in StackExchange between myself and > Sean Owen. Please see > http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues > > I'm building a boolean-based recommendation engine with the following data: > > - 12M users > - 2M items > - 18M user-item (boolean) choices > > The following code is used to build the recommender: > > DataModel dataModel = new FileDataModel(new File(dataFile)); > ItemSimilarity itemSimilarity = new CachingItemSimilarity(new > LogLikelihoodSimilarity(dataModel), dataModel); > CandidateItemsStrategy candidateItemsStrategy = new > SamplingCandidateItemsStrategy(20, 5); > MostSimilarItemsCandidateItemsStrategy > mostSimilarItemsCandidateItemsStrategy = new > SamplingCandidateItemsStrategy(20, 5); > > this.recommender = new GenericBooleanPrefItemBasedRecommender(dataModel, > itemSimilarity, > candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy); > > My app runs on a Tomcat with the following JVM arguments: > *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC > -XX:+UseParallelOldGC* > > Recommendations with the code above works very well for users who have made > 1-2 choices in the past, but can take over to a minute when a user had made > tens of choices, especially if one of these choices is a very popular item > (i.e. was chosen by many other users). > > Even when using the *SamplingCandidateItemsStrategy* with (1,1) arguments, > I still did not manage to achieve fast results. > > The only way I managed to get somewhat OK results (max recommendation time > ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a way > that *doGetCandidateItems* returns a limited amount of items. Following is > the doGetCandidateItems method as I re-wrote it: > http://pastebin.com/6n9C8Pw1 > > **I think a good response time for recommendations should be less than a > second (preferably less than 500 milliseconds).** > How can I make Mahout perform better? I have a feeling some optimization is > needed both on the *CandidateItemsStrategy* and the *Recommender* itself. > * > * > Thanks in advance! > Daniel >
