No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user.
On Jul 18, 2013, at 12:06 PM, Peng Cheng <pc...@uowmail.edu.au> wrote: Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: > It was about 2.5M users and 500K items with 25M actions over 6 months of data. > > On Jul 18, 2013, at 10:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote: > > If I remember right, a highlight of 0.8 release is an online clustering > algorithm. I'm not sure if it can be used in item-based recommender, but this > is definitely I would like to pursue. It's probably the only advantage a > non-hadoop implementation can offer in the future. > > Many non-hadoop recommenders are pretty fast. But existing in-memory > GenericDataModel and FileDataModel are largely implemented for sandboxes, > IMHO they are the culprit of scalability problem. > > May I ask about the scale of your dataset? how many rating does it have? > > Yours Peng > > On 13-07-18 12:14 PM, Sebastian Schelter wrote: >> Well, with itembased the only problem is new items. New users can >> immediately be served by the model (although this is not well supported by >> the API in Mahout). For the majority of usecases I saw, it is perfectly >> fine to have a short delay until new items "enter" the recommender, usually >> this happens after a retraining in batch. You have to care for cold-start >> and collect some interactions anyway. >> >> >> 2013/7/18 Pat Ferrel <pat.fer...@gmail.com> >> >>> Yes, what Myrrix does is good. >>> >>> My last aside was a wish for an item-based online recommender not only >>> factorized. Ted talks about using Solr for this, which we're experimenting >>> with alongside Myrrix. I suspect Solr works but it does require a bit of >>> tinkering and doesn't have quite the same set of options--no llr similarity >>> for instance. >>> >>> On the same subject I recently attended a workshop in Seattle for UAI2013 >>> where Walmart reported similar results using a factorized recommender. They >>> had to increase the factor number past where it would perform well. Along >>> the way they saw increasing performance measuring precision offline. They >>> eventually gave up on a factorized solution. This decision seems odd but >>> anyway… In the case of Walmart and our data set they are quite diverse. The >>> best idea is probably to create different recommenders for separate parts >>> of the catalog but if you create one model on all items our intuition is >>> that item-based works better than factorized. Again caveat--no A/B tests to >>> support this yet. >>> >>> Doing an online item-based recommender would quickly run into scaling >>> problems, no? We put together the simple Mahout in-memory version and it >>> could not really handle more than a down-sampled few months of our data. >>> Down-sampling lost us 20% of our precision scores so we moved to the hadoop >>> version. Now we have use-cases for an online recommender that handles >>> anonymous new users and that takes the story full circle. >>> >>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <s...@apache.org> wrote: >>> >>> Hi Pat >>> >>> I think we should provide a simple support for recommending to anonymous >>> users. We should have a method recommendToAnonymous() that takes a >>> PreferenceArray as argument. For itembased recommenders, its >>> straightforward to compute recommendations, for userbased you have to >>> search through all users once, for latent factor models, you have to fold >>> the user vector into the low dimensional space. >>> >>> I think Sean already added this method in myrrix and I have some code for >>> my kornakapi project (a simple weblayer for mahout). >>> >>> Would such a method fit your needs? >>> >>> Best, >>> Sebastian >>> >>> >>> >>> 2013/7/17 Pat Ferrel <pat.fer...@gmail.com> >>> >>>> May I ask how you plan to support model updates and 'anonymous' users? >>>> >>>> I assume the latent factors model is calculated offline still in batch >>>> mode, then there are periodic updates? How are the updates handled? Do >>> you >>>> plan to require batch model refactorization for any update? Or perform >>> some >>>> partial update by maybe just transforming new data into the LF space >>>> already in place then doing full refactorization every so often in batch >>>> mode? >>>> >>>> By 'anonymous users' I mean users with some history that is not yet >>>> incorporated in the LF model. This could be history from a new user asked >>>> to pick a few items to start the rec process, or an old user with some >>> new >>>> action history not yet in the model. Are you going to allow for passing >>> the >>>> entire history vector or userID+incremental new history to the >>> recommender? >>>> I hope so. >>>> >>>> For what it's worth we did a comparison of Mahout Item based CF to Mahout >>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months >>> of >>>> data. The data was purchase data from a diverse ecom source with a large >>>> variety of products from electronics to clothes. We found Item based CF >>> did >>>> far better than ALS. As we increased the number of latent factors the >>>> results got better but were never within 10% of item based (we used MAP >>> as >>>> the offline metric). Not sure why but maybe it has to do with the >>> diversity >>>> of the item types. >>>> >>>> I understand that a full item based online recommender has very different >>>> tradeoffs and anyway others may not have seen this disparity of results. >>>> Furthermore we don't have A/B test results yet to validate the offline >>>> metric. >>>> >>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gkhn...@gmail.com> wrote: >>>> >>>> Peng, >>>> >>>> This is the reason I separated out the DataModel, and only put the >>> learner >>>> stuff there. The learner I mentioned yesterday just stores the >>>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care >>>> where preferences are stored. >>>> >>>> I, kind of, agree with the multi-level DataModel approach: >>>> One for iterating over "all" preferences, one for if one wants to deploy >>> a >>>> recommender and perform a lot of top-N recommendation tasks. >>>> >>>> (Or one DataModel with a strategy that might reduce existing memory >>>> consumption, while still providing fast access, I am not sure. Let me >>> try a >>>> matrix-backed DataModel approach) >>>> >>>> Gokhan >>>> >>>> >>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <s...@apache.org> >>>> wrote: >>>> >>>>> I completely agree, Netflix is less than one gigabye in a smart >>>>> representation, 12x more memory is a nogo. The techniques used in >>>>> FactorizablePreferences allow a much more memory efficient >>>> representation, >>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits >>>> into >>>>> 3GB with that approach. >>>>> >>>>> >>>>> 2013/7/16 Ted Dunning <ted.dunn...@gmail.com> >>>>> >>>>>> Netflix is a small dataset. 12G for that seems quite excessive. >>>>>> >>>>>> Note also that this is before you have done any work. >>>>>> >>>>>> Ideally, 100million observations should take << 1GB. >>>>>> >>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au> >>>>> wrote: >>>>>>> The second idea is indeed splendid, we should separate time-complexity >>>>>>> first and space-complexity first implementation. What I'm not quite >>>>> sure, >>>>>>> is that if we really need to create two interfaces instead of one. >>>>>>> Personally, I think 12G heap space is not that high right? Most new >>>>>> laptop >>>>>>> can already handle that (emphasis on laptop). And if we replace hash >>>>> map >>>>>>> (the culprit of high memory consumption) with list/linkedList, it >>> would >>>>>>> simply degrade time complexity for a linear search to O(n), not too >>> bad >>>>>>> either. The current DataModel is a result of careful thoughts and has >>>>>>> underwent extensive test, it is easier to expand on top of it instead >>>>> of >>>>>>> subverting it. > > >