I will now try using the latest snapshot from http://svn.apache.org/repos/asf/mahout/trunk .
I would really prefer to avoid pre-computing the item similarities at the moment. Do you believe I can achieve good performance without it? Is there any specific pruning method you would recommend? I guess this is only relevant in recommendation-time, as the all data is needed in order for all of my users to be able get recommendations. Lastly, as I wrote in before, if a user had chose in the past an item which many other users had chose as well (we have a few items with 100k-400k associated users) the recommendation is significantly slower (unless he chose only very few items). Maybe it's a hint for the bottleneck in the similarities computations? On Wed, Nov 30, 2011 at 4:24 PM, Sean Owen <[email protected]> wrote: > Yeah, I agree that using just a handful of candidates is far too few and > that's not a solution. It should not be so slow even with a reasonable > number of prefs and users. > > Multi-threading *is* a problem insofar as there is no multi-threading > helping speed up your request. But that's a side issue. > > Definitely use the version in Subversion instead of 0.5; I think it will > help directly. > 5000 item-item similarities shouldn't take that long to compute. > > You can try pre-computing similarities as you mentioned earlier, if you can > find more RAM. > > Of course, I always also recommend looking at pruning; often 90% of your > data says very little and 10% carries most of the information. Figuring out > which is which is hard! But it's often possible to drastically simplify > things by finding some clever way of deciding what's not useful. > > And of course you can always start to look at Hadoop-based recommenders. > > > On Wed, Nov 30, 2011 at 2:19 PM, Daniel Zohar <[email protected]> wrote: > > > Hi Sean, > > First of all let me thank you for all your help thus far :) > > > > I am using Mahout 0.5. > > At the moment the application is not live yet, so I assume > multi-threading > > is not a problem at the moment. > > > > I definitely see that the bottleneck is in the similarities computations. > > Looking at TopItems:getTopItems, I can see that the method iterates over > > all the 'possible items' and evaluates them using the Estimator which in > > turn iterates over all the past user choices for every possible item. > > Now lets assume a user chose 50 items before and has 100 possible items, > > that's already 5k item-item similarities to calculate. If I wouldn't cap > > the possible items, it can wind up at much larger numbers. > > > > I would also like to add that although the solution I post before > improves > > performance, it extremely damages the quality of the recommendations as > it > > checks a smaller pool of possible items. > > > > Thanks! > > > > On Wed, Nov 30, 2011 at 3:47 PM, Sean Owen <[email protected]> wrote: > > > > > I have a few more thoughts. > > > > > > First, I was wrong about what the first parameter to > > > SamplingCandidateStrategy means. It's effectively a minimum, rather > than > > > maximum; setting to 1 just means it will sample at least 1 pref. I > think > > > you figured that out. I think values like (5,1) are probably about > right > > > for you. > > > > > > I see that your change is to further impose a global cap on the number > of > > > candidate items returned. I understand the logic of that -- Sebastian > > what > > > do you think? (PS you can probably make that run slightly faster by > using > > > LongPrimitiveIterator instead of Iterator<Long>.) > > > > > > > > > Something still feels a bit off here, that's a very long time. Your JVM > > > params are impeccable, you have a good amount of RAM and strong > machine. > > > > > > Since you're getting speed-up by directly reducing the number of > > candidate > > > items, I get the idea that your similarity computations are the > > bottleneck. > > > Does any of your profiling confirm that? > > > > > > Are you using the latest code? I can think of one change in the last > few > > > months that I added (certainly since 0.5) that would speed up > > > LogLikelihoodSimilarity a fair bit. I know you're 'boolean' data so > this > > > ought to be very fast. > > > > > > > > > I'll also say that the computation here is not multi-threaded. I had > > always > > > sort of thought that, at scale, you'd be getting parallelism from > > handling > > > multiple concurrent requests. It would be possible to rewrite a lot of > > the > > > internals to compute top recs using multiple threads. That might make > > > individual requests return faster on a multi-core machine though > wouldn't > > > increase overall throughput. > > > > > > > > > On Wed, Nov 30, 2011 at 9:11 AM, Daniel Zohar <[email protected]> > > wrote: > > > > > > > Hello all, > > > > This email follows the correspondence in StackExchange between myself > > and > > > > Sean Owen. Please see > > > > > > > > > > http://stackoverflow.com/questions/8240383/apache-mahout-performance-issues > > > > > > > > I'm building a boolean-based recommendation engine with the following > > > data: > > > > > > > > - 12M users > > > > - 2M items > > > > - 18M user-item (boolean) choices > > > > > > > > The following code is used to build the recommender: > > > > > > > > DataModel dataModel = new FileDataModel(new File(dataFile)); > > > > ItemSimilarity itemSimilarity = new CachingItemSimilarity(new > > > > LogLikelihoodSimilarity(dataModel), dataModel); > > > > CandidateItemsStrategy candidateItemsStrategy = new > > > > SamplingCandidateItemsStrategy(20, 5); > > > > MostSimilarItemsCandidateItemsStrategy > > > > mostSimilarItemsCandidateItemsStrategy = new > > > > SamplingCandidateItemsStrategy(20, 5); > > > > > > > > this.recommender = new > > GenericBooleanPrefItemBasedRecommender(dataModel, > > > > itemSimilarity, > > > > candidateItemsStrategy,mostSimilarItemsCandidateItemsStrategy); > > > > > > > > My app runs on a Tomcat with the following JVM arguments: > > > > *-Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=19 -XX:+UseParallelGC > > > > -XX:+UseParallelOldGC* > > > > > > > > Recommendations with the code above works very well for users who > have > > > made > > > > 1-2 choices in the past, but can take over to a minute when a user > had > > > made > > > > tens of choices, especially if one of these choices is a very popular > > > item > > > > (i.e. was chosen by many other users). > > > > > > > > Even when using the *SamplingCandidateItemsStrategy* with (1,1) > > > arguments, > > > > I still did not manage to achieve fast results. > > > > > > > > The only way I managed to get somewhat OK results (max recommendation > > > time > > > > ~4 secs), was by rewriting the *SamplingCandidateItemsStrategy* in a > > way > > > > that *doGetCandidateItems* returns a limited amount of items. > Following > > > is > > > > the doGetCandidateItems method as I re-wrote it: > > > > http://pastebin.com/6n9C8Pw1 > > > > > > > > **I think a good response time for recommendations should be less > than > > a > > > > second (preferably less than 500 milliseconds).** > > > > How can I make Mahout perform better? I have a feeling some > > optimization > > > is > > > > needed both on the *CandidateItemsStrategy* and the *Recommender* > > itself. > > > > * > > > > * > > > > Thanks in advance! > > > > Daniel > > > > > > > > > >
