It was about 2.5M users and 500K items with 25M actions over 6 months of data.
On Jul 18, 2013, at 10:15 AM, Peng Cheng <[email protected]> wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: > Well, with itembased the only problem is new items. New users can > immediately be served by the model (although this is not well supported by > the API in Mahout). For the majority of usecases I saw, it is perfectly > fine to have a short delay until new items "enter" the recommender, usually > this happens after a retraining in batch. You have to care for cold-start > and collect some interactions anyway. > > > 2013/7/18 Pat Ferrel <[email protected]> > >> Yes, what Myrrix does is good. >> >> My last aside was a wish for an item-based online recommender not only >> factorized. Ted talks about using Solr for this, which we're experimenting >> with alongside Myrrix. I suspect Solr works but it does require a bit of >> tinkering and doesn't have quite the same set of options--no llr similarity >> for instance. >> >> On the same subject I recently attended a workshop in Seattle for UAI2013 >> where Walmart reported similar results using a factorized recommender. They >> had to increase the factor number past where it would perform well. Along >> the way they saw increasing performance measuring precision offline. They >> eventually gave up on a factorized solution. This decision seems odd but >> anyway… In the case of Walmart and our data set they are quite diverse. The >> best idea is probably to create different recommenders for separate parts >> of the catalog but if you create one model on all items our intuition is >> that item-based works better than factorized. Again caveat--no A/B tests to >> support this yet. >> >> Doing an online item-based recommender would quickly run into scaling >> problems, no? We put together the simple Mahout in-memory version and it >> could not really handle more than a down-sampled few months of our data. >> Down-sampling lost us 20% of our precision scores so we moved to the hadoop >> version. Now we have use-cases for an online recommender that handles >> anonymous new users and that takes the story full circle. >> >> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <[email protected]> wrote: >> >> Hi Pat >> >> I think we should provide a simple support for recommending to anonymous >> users. We should have a method recommendToAnonymous() that takes a >> PreferenceArray as argument. For itembased recommenders, its >> straightforward to compute recommendations, for userbased you have to >> search through all users once, for latent factor models, you have to fold >> the user vector into the low dimensional space. >> >> I think Sean already added this method in myrrix and I have some code for >> my kornakapi project (a simple weblayer for mahout). >> >> Would such a method fit your needs? >> >> Best, >> Sebastian >> >> >> >> 2013/7/17 Pat Ferrel <[email protected]> >> >>> May I ask how you plan to support model updates and 'anonymous' users? >>> >>> I assume the latent factors model is calculated offline still in batch >>> mode, then there are periodic updates? How are the updates handled? Do >> you >>> plan to require batch model refactorization for any update? Or perform >> some >>> partial update by maybe just transforming new data into the LF space >>> already in place then doing full refactorization every so often in batch >>> mode? >>> >>> By 'anonymous users' I mean users with some history that is not yet >>> incorporated in the LF model. This could be history from a new user asked >>> to pick a few items to start the rec process, or an old user with some >> new >>> action history not yet in the model. Are you going to allow for passing >> the >>> entire history vector or userID+incremental new history to the >> recommender? >>> I hope so. >>> >>> For what it's worth we did a comparison of Mahout Item based CF to Mahout >>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months >> of >>> data. The data was purchase data from a diverse ecom source with a large >>> variety of products from electronics to clothes. We found Item based CF >> did >>> far better than ALS. As we increased the number of latent factors the >>> results got better but were never within 10% of item based (we used MAP >> as >>> the offline metric). Not sure why but maybe it has to do with the >> diversity >>> of the item types. >>> >>> I understand that a full item based online recommender has very different >>> tradeoffs and anyway others may not have seen this disparity of results. >>> Furthermore we don't have A/B test results yet to validate the offline >>> metric. >>> >>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <[email protected]> wrote: >>> >>> Peng, >>> >>> This is the reason I separated out the DataModel, and only put the >> learner >>> stuff there. The learner I mentioned yesterday just stores the >>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care >>> where preferences are stored. >>> >>> I, kind of, agree with the multi-level DataModel approach: >>> One for iterating over "all" preferences, one for if one wants to deploy >> a >>> recommender and perform a lot of top-N recommendation tasks. >>> >>> (Or one DataModel with a strategy that might reduce existing memory >>> consumption, while still providing fast access, I am not sure. Let me >> try a >>> matrix-backed DataModel approach) >>> >>> Gokhan >>> >>> >>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <[email protected]> >>> wrote: >>> >>>> I completely agree, Netflix is less than one gigabye in a smart >>>> representation, 12x more memory is a nogo. The techniques used in >>>> FactorizablePreferences allow a much more memory efficient >>> representation, >>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits >>> into >>>> 3GB with that approach. >>>> >>>> >>>> 2013/7/16 Ted Dunning <[email protected]> >>>> >>>>> Netflix is a small dataset. 12G for that seems quite excessive. >>>>> >>>>> Note also that this is before you have done any work. >>>>> >>>>> Ideally, 100million observations should take << 1GB. >>>>> >>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <[email protected]> >>>> wrote: >>>>>> The second idea is indeed splendid, we should separate time-complexity >>>>>> first and space-complexity first implementation. What I'm not quite >>>> sure, >>>>>> is that if we really need to create two interfaces instead of one. >>>>>> Personally, I think 12G heap space is not that high right? Most new >>>>> laptop >>>>>> can already handle that (emphasis on laptop). And if we replace hash >>>> map >>>>>> (the culprit of high memory consumption) with list/linkedList, it >> would >>>>>> simply degrade time complexity for a linear search to O(n), not too >> bad >>>>>> either. The current DataModel is a result of careful thoughts and has >>>>>> underwent extensive test, it is easier to expand on top of it instead >>>> of >>>>>> subverting it. >>> >>
