Yeah, I agree that using just a handful of candidates is far too few and that's not a solution. It should not be so slow even with a reasonable number of prefs and users.
Multi-threading *is* a problem insofar as there is no multi-threading helping speed up your request. But that's a side issue. Definitely use the version in Subversion instead of 0.5; I think it will help directly. 5000 item-item similarities shouldn't take that long to compute. You can try pre-computing similarities as you mentioned earlier, if you can find more RAM. Of course, I always also recommend looking at pruning; often 90% of your data says very little and 10% carries most of the information. Figuring out which is which is hard! But it's often possible to drastically simplify things by finding some clever way of deciding what's not useful. And of course you can always start to look at Hadoop-based recommenders. On Wed, Nov 30, 2011 at 2:19 PM, Daniel Zohar <[email protected]> wrote: > Hi Sean, > First of all let me thank you for all your help thus far :) > > I am using Mahout 0.5. > At the moment the application is not live yet, so I assume multi-threading > is not a problem at the moment. > > I definitely see that the bottleneck is in the similarities computations. > Looking at TopItems:getTopItems, I can see that the method iterates over > all the 'possible items' and evaluates them using the Estimator which in > turn iterates over all the past user choices for every possible item. > Now lets assume a user chose 50 items before and has 100 possible items, > that's already 5k item-item similarities to calculate. If I wouldn't cap > the possible items, it can wind up at much larger numbers. > > I would also like to add that although the solution I post before improves > performance, it extremely damages the quality of the recommendations as it > checks a smaller pool of possible items. > > Thanks! > > On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <[email protected]> wrote: > > > I have a few more thoughts. > > > > First, I was wrong about what the first parameter to > > SamplingCandidateStrategy means. It's effectively a minimum, rather than > > maximum; setting to 1 just means it will sample at least 1 pref. I think > > you figured that out. I think values like (5,1) are probably about right > > for you. > > > > I see that your change is to further impose a global cap on the number of > > candidate items returned. I understand the logic of that -- Sebastian > what > > do you think? (PS you can probably make that run slightly faster by using > > LongPrimitiveIterator instead of Iterator<Long>.) > > > > > > Something still feels a bit off here, that's a very long time. Your JVM > > params are impeccable, you have a good amount of RAM and strong machine. > > > > Since you're getting speed-up by directly reducing the number of > candidate > > items, I get the idea that your similarity computations are the > bottleneck. > > Does any of your profiling confirm that? > > > > Are you using the latest code? I can think of one change in the last few > > months that I added (certainly since 0.5) that would speed up > > LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so this > > ought to be very fast. > > > > > > I'll also say that the computation here is not multi-threaded. I had > always > > sort of thought that, at scale, you'd be getting parallelism from > handling > > multiple concurrent requests. It would be possible to rewrite a lot of > the > > internals to compute top recs using multiple threads. That might make > > individual requests return faster on a multi-core machine though wouldn't > > increase overall throughput. > > > > > > On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[email protected]> > wrote: > > > > > Hello all, > > > This email follows the correspondence in StackExchange between myself > and > > > Sean Owen. Please see > > > > > > http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues > > > > > > I'm building a boolean-based recommendation engine with the following > > data: > > > > > > - 12M users > > > - 2M items > > > - 18M user-item (boolean) choices > > > > > > The following code is used to build the recommender: > > > > > > DataModel dataModel = new FileDataModel(new File(dataFile)); > > > ItemSimilarity itemSimilarity = new CachingItemSimilarity(new > > > LogLikelihoodSimilarity(dataModel), dataModel); > > > CandidateItemsStrategy candidateItemsStrategy = new > > > SamplingCandidateItemsStrategy(20, 5); > > > MostSimilarItemsCandidateItemsStrategy > > > mostSimilarItemsCandidateItemsStrategy = new > > > SamplingCandidateItemsStrategy(20, 5); > > > > > > this.recommender = new > GenericBooleanPrefItemBasedRecommender(dataModel, > > > itemSimilarity, > > > candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy); > > > > > > My app runs on a Tomcat with the following JVM arguments: > > > *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC > > > -XX:+UseParallelOldGC* > > > > > > Recommendations with the code above works very well for users who have > > made > > > 1-2 choices in the past, but can take over to a minute when a user had > > made > > > tens of choices, especially if one of these choices is a very popular > > item > > > (i.e. was chosen by many other users). > > > > > > Even when using the *SamplingCandidateItemsStrategy* with (1,1) > > arguments, > > > I still did not manage to achieve fast results. > > > > > > The only way I managed to get somewhat OK results (max recommendation > > time > > > ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a > way > > > that *doGetCandidateItems* returns a limited amount of items. Following > > is > > > the doGetCandidateItems method as I re-wrote it: > > > http://pastebin.com/6n9C8Pw1 > > > > > > **I think a good response time for recommendations should be less than > a > > > second (preferably less than 500 milliseconds).** > > > How can I make Mahout perform better? I have a feeling some > optimization > > is > > > needed both on the *CandidateItemsStrategy* and the *Recommender* > itself. > > > * > > > * > > > Thanks in advance! > > > Daniel > > > > > >
