Hello, On 01.12.2011, at 09:37, Sebastian Schelter wrote:
> Daniel, can you plot two curves showing the distribution of > interactions per user and the distribution of interactions per item? I > think we need to get a better picture of your data first. > > Generally I always recommend to use precomputed similarities. You can > still serve new users with realtime recommendations, the only > disadvantages are the higher complexity and a delayed inclusion of new > items. In this paper: Fast Online Learning through Offline Initialization for Time-sensitive Recommendation http://users.cs.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p703.pdf Deepak Agarwal et. al. describes a solution how to include new items quickly into the recommendations. This is used for personalizing the news stories on the yahoo start page. @Daniel: I would also recommend to profile your application with JVisualVM: http://visualvm.java.net/ After I did this with my recommender. I figured out that the default cache size for item similarities was far to low. The details are described in this ticket: https://issues.apache.org/jira/browse/MAHOUT-905 > > --sebastian /Manuel > > 2011/11/30 Sean Owen <[email protected]>: >> The simple answer is that: >> >> Mahout absorbed a non-distributed recommender project called Taste, which >> scales up to a point which may be sufficient for a lot of users. It >> certainly is a lot simpler. Yes it is realistic to do near-real-time >> recommendations, though it gets harder and harder and requires more tuning, >> tradeoffs and optimization as this thread shows. >> >> The rest, written from scratch, is almost all distributed and Hadoop-based, >> including distributed re-implementations of the same algorithms. >> >> On Wed, Nov 30, 2011 at 8:23 PM, Dan Beaulieu >> <[email protected]>wrote: >> >>> Hi all, this is a tangent and can mostly be ignored by the people >>> interested in this problem. >>> >>> I'm new to Machine Learning and especially Mahout. Following this >>> discussion has made me a bit confused. >>> Isn't Mahout used for large datasets where it makes sense to distribute the >>> work? Why then isn't anyone pointing >>> out that the problem may be the use of one single Mahout node? Is it >>> because it's boolean based? Is it because the data set >>> isn't really that large? >>> >>> Even if for whatever reason a single node will do for this case, is it >>> really expected that the recommendation process would finish in less than >>> half a second? >>> This makes me think if that is the expectation then the data set is >>> actually small and Mahout might be overkill... >>> >>> What obvious piece of the Mahout puzzle am I missing? >>> >>> Thanks. >>> >>> Dan >>> >>> On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <[email protected]> wrote: >>> >>>> Have you used CachingItemSimilarity? That will hold common similarities >>> in >>>> memory. It's a lot easier than pre-computing and might help. >>>> >>>> I think something like your change is a good one (Sebastian what do you >>>> think) in that it gives you the ultimate lever to control how many >>>> candidates are evaluated. That ought to make it go as fast as you like, >>> but >>>> it trades off quality. Still I'd be really surprised if there's no viable >>>> middle ground -- this works fine at smaller scale, where 100s of >>> candidates >>>> are evaluated, perhaps, and you can use your lever to get to 100s of >>>> candidates at your scale too. Is that still both slow and inaccurate? >>>> >>>> On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[email protected]> >>> wrote: >>>> >>>>> I just tested the app with Mahout 0.6. >>>>> There seems to be a small performance improvement, but still >>>>> recommendations for the 'heavy users' take between 1-5 seconds. >>>>> >>>>> >>>> >>> -- Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
