I just tested the app with Mahout 0.6. There seems to be a small performance improvement, but still recommendations for the 'heavy users' take between 1-5 seconds.
On Wed, Nov 30, 2011 at 4:50 PM, Daniel Zohar <[email protected]> wrote: > I will now try using the latest snapshot from > http://svn.apache.org/repos/asf/mahout/trunk . > > I would really prefer to avoid pre-computing the item similarities at the > moment. Do you believe I can achieve good performance without it? > > Is there any specific pruning method you would recommend? I guess this is > only relevant in recommendation-time, as the all data is needed in order > for all of my users to be able get recommendations. > > Lastly, as I wrote in before, if a user had chose in the past an item > which many other users had chose as well (we have a few items with > 100k-400k associated users) the recommendation is significantly slower > (unless he chose only very few items). Maybe it's a hint for the bottleneck > in the similarities computations? > > > On Wed, Nov 30, 2011 at 4:24 PM, Sean Owen <[email protected]> wrote: > >> Yeah, I agree that using just a handful of candidates is far too few and >> that's not a solution. It should not be so slow even with a reasonable >> number of prefs and users. >> >> Multi-threading *is* a problem insofar as there is no multi-threading >> helping speed up your request. But that's a side issue. >> >> Definitely use the version in Subversion instead of 0.5; I think it will >> help directly. >> 5000 item-item similarities shouldn't take that long to compute. >> >> You can try pre-computing similarities as you mentioned earlier, if you >> can >> find more RAM. >> >> Of course, I always also recommend looking at pruning; often 90% of your >> data says very little and 10% carries most of the information. Figuring >> out >> which is which is hard! But it's often possible to drastically simplify >> things by finding some clever way of deciding what's not useful. >> >> And of course you can always start to look at Hadoop-based recommenders. >> >> >> On Wed, Nov 30, 2011 at 2:19 PM, Daniel Zohar <[email protected]> wrote: >> >> > Hi Sean, >> > First of all let me thank you for all your help thus far :) >> > >> > I am using Mahout 0.5. >> > At the moment the application is not live yet, so I assume >> multi-threading >> > is not a problem at the moment. >> > >> > I definitely see that the bottleneck is in the similarities >> computations. >> > Looking at TopItems:getTopItems, I can see that the method iterates over >> > all the 'possible items' and evaluates them using the Estimator which in >> > turn iterates over all the past user choices for every possible item. >> > Now lets assume a user chose 50 items before and has 100 possible items, >> > that's already 5k item-item similarities to calculate. If I wouldn't cap >> > the possible items, it can wind up at much larger numbers. >> > >> > I would also like to add that although the solution I post before >> improves >> > performance, it extremely damages the quality of the recommendations as >> it >> > checks a smaller pool of possible items. >> > >> > Thanks! >> > >> > On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <[email protected]> wrote: >> > >> > > I have a few more thoughts. >> > > >> > > First, I was wrong about what the first parameter to >> > > SamplingCandidateStrategy means. It's effectively a minimum, rather >> than >> > > maximum; setting to 1 just means it will sample at least 1 pref. I >> think >> > > you figured that out. I think values like (5,1) are probably about >> right >> > > for you. >> > > >> > > I see that your change is to further impose a global cap on the >> number of >> > > candidate items returned. I understand the logic of that -- Sebastian >> > what >> > > do you think? (PS you can probably make that run slightly faster by >> using >> > > LongPrimitiveIterator instead of Iterator<Long>.) >> > > >> > > >> > > Something still feels a bit off here, that's a very long time. Your >> JVM >> > > params are impeccable, you have a good amount of RAM and strong >> machine. >> > > >> > > Since you're getting speed-up by directly reducing the number of >> > candidate >> > > items, I get the idea that your similarity computations are the >> > bottleneck. >> > > Does any of your profiling confirm that? >> > > >> > > Are you using the latest code? I can think of one change in the last >> few >> > > months that I added (certainly since 0.5) that would speed up >> > > LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so >> this >> > > ought to be very fast. >> > > >> > > >> > > I'll also say that the computation here is not multi-threaded. I had >> > always >> > > sort of thought that, at scale, you'd be getting parallelism from >> > handling >> > > multiple concurrent requests. It would be possible to rewrite a lot of >> > the >> > > internals to compute top recs using multiple threads. That might make >> > > individual requests return faster on a multi-core machine though >> wouldn't >> > > increase overall throughput. >> > > >> > > >> > > On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[email protected]> >> > wrote: >> > > >> > > > Hello all, >> > > > This email follows the correspondence in StackExchange between >> myself >> > and >> > > > Sean Owen. Please see >> > > > >> > > >> > >> http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues >> > > > >> > > > I'm building a boolean-based recommendation engine with the >> following >> > > data: >> > > > >> > > > - 12M users >> > > > - 2M items >> > > > - 18M user-item (boolean) choices >> > > > >> > > > The following code is used to build the recommender: >> > > > >> > > > DataModel dataModel = new FileDataModel(new File(dataFile)); >> > > > ItemSimilarity itemSimilarity = new CachingItemSimilarity(new >> > > > LogLikelihoodSimilarity(dataModel), dataModel); >> > > > CandidateItemsStrategy candidateItemsStrategy = new >> > > > SamplingCandidateItemsStrategy(20, 5); >> > > > MostSimilarItemsCandidateItemsStrategy >> > > > mostSimilarItemsCandidateItemsStrategy = new >> > > > SamplingCandidateItemsStrategy(20, 5); >> > > > >> > > > this.recommender = new >> > GenericBooleanPrefItemBasedRecommender(dataModel, >> > > > itemSimilarity, >> > > > candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy); >> > > > >> > > > My app runs on a Tomcat with the following JVM arguments: >> > > > *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC >> > > > -XX:+UseParallelOldGC* >> > > > >> > > > Recommendations with the code above works very well for users who >> have >> > > made >> > > > 1-2 choices in the past, but can take over to a minute when a user >> had >> > > made >> > > > tens of choices, especially if one of these choices is a very >> popular >> > > item >> > > > (i.e. was chosen by many other users). >> > > > >> > > > Even when using the *SamplingCandidateItemsStrategy* with (1,1) >> > > arguments, >> > > > I still did not manage to achieve fast results. >> > > > >> > > > The only way I managed to get somewhat OK results (max >> recommendation >> > > time >> > > > ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a >> > way >> > > > that *doGetCandidateItems* returns a limited amount of items. >> Following >> > > is >> > > > the doGetCandidateItems method as I re-wrote it: >> > > > http://pastebin.com/6n9C8Pw1 >> > > > >> > > > **I think a good response time for recommendations should be less >> than >> > a >> > > > second (preferably less than 500 milliseconds).** >> > > > How can I make Mahout perform better? I have a feeling some >> > optimization >> > > is >> > > > needed both on the *CandidateItemsStrategy* and the *Recommender* >> > itself. >> > > > * >> > > > * >> > > > Thanks in advance! >> > > > Daniel >> > > > >> > > >> > >> > >
