I just started to implement a Matrix backed data model and pushed it, to check the performance and memory considerations.
I believe I can try it on some data tomorrow. Best Gokhan On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng <[email protected]> wrote: > I see, sorry I was too presumptuous. I only recently worked and tested > SVDRecommender, never could have known its efficiency using an item-based > recommender. Maybe there is space for algorithmic optimization. > > The online recommender Gokhan is working on is also an SVDRecommender. An > online user-based or item-based recommender based on clustering technique > would definitely be critical, but we need an expert to volunteer :) > > Perhaps Dr Dunning can have a few words? He announced the online > clustering component. > > Yours Peng > > > On 13-07-18 03:54 PM, Pat Ferrel wrote: > >> No it was CPU bound not memory. I gave it something like 14G heap. It was >> running, just too slow to be of any real use. We switched to the hadoop >> version and stored precalculated recs in a db for every user. >> >> On Jul 18, 2013, at 12:06 PM, Peng Cheng <[email protected]> wrote: >> >> Strange, its just a little bit larger than limibseti dataset (17m >> ratings), did you encountered an outOfMemory or GCTimeOut exception? >> Allocating more heap space usually help. >> >> Yours Peng >> >> On 13-07-18 02:27 PM, Pat Ferrel wrote: >> >>> It was about 2.5M users and 500K items with 25M actions over 6 months of >>> data. >>> >>> On Jul 18, 2013, at 10:15 AM, Peng Cheng <[email protected]> wrote: >>> >>> If I remember right, a highlight of 0.8 release is an online clustering >>> algorithm. I'm not sure if it can be used in item-based recommender, but >>> this is definitely I would like to pursue. It's probably the only advantage >>> a non-hadoop implementation can offer in the future. >>> >>> Many non-hadoop recommenders are pretty fast. But existing in-memory >>> GenericDataModel and FileDataModel are largely implemented for sandboxes, >>> IMHO they are the culprit of scalability problem. >>> >>> May I ask about the scale of your dataset? how many rating does it have? >>> >>> Yours Peng >>> >>> On 13-07-18 12:14 PM, Sebastian Schelter wrote: >>> >>>> Well, with itembased the only problem is new items. New users can >>>> immediately be served by the model (although this is not well supported >>>> by >>>> the API in Mahout). For the majority of usecases I saw, it is perfectly >>>> fine to have a short delay until new items "enter" the recommender, >>>> usually >>>> this happens after a retraining in batch. You have to care for >>>> cold-start >>>> and collect some interactions anyway. >>>> >>>> >>>> 2013/7/18 Pat Ferrel <[email protected]> >>>> >>>> Yes, what Myrrix does is good. >>>>> >>>>> My last aside was a wish for an item-based online recommender not only >>>>> factorized. Ted talks about using Solr for this, which we're >>>>> experimenting >>>>> with alongside Myrrix. I suspect Solr works but it does require a bit >>>>> of >>>>> tinkering and doesn't have quite the same set of options--no llr >>>>> similarity >>>>> for instance. >>>>> >>>>> On the same subject I recently attended a workshop in Seattle for >>>>> UAI2013 >>>>> where Walmart reported similar results using a factorized recommender. >>>>> They >>>>> had to increase the factor number past where it would perform well. >>>>> Along >>>>> the way they saw increasing performance measuring precision offline. >>>>> They >>>>> eventually gave up on a factorized solution. This decision seems odd >>>>> but >>>>> anyway… In the case of Walmart and our data set they are quite >>>>> diverse. The >>>>> best idea is probably to create different recommenders for separate >>>>> parts >>>>> of the catalog but if you create one model on all items our intuition >>>>> is >>>>> that item-based works better than factorized. Again caveat--no A/B >>>>> tests to >>>>> support this yet. >>>>> >>>>> Doing an online item-based recommender would quickly run into scaling >>>>> problems, no? We put together the simple Mahout in-memory version and >>>>> it >>>>> could not really handle more than a down-sampled few months of our >>>>> data. >>>>> Down-sampling lost us 20% of our precision scores so we moved to the >>>>> hadoop >>>>> version. Now we have use-cases for an online recommender that handles >>>>> anonymous new users and that takes the story full circle. >>>>> >>>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <[email protected]> >>>>> wrote: >>>>> >>>>> Hi Pat >>>>> >>>>> I think we should provide a simple support for recommending to >>>>> anonymous >>>>> users. We should have a method recommendToAnonymous() that takes a >>>>> PreferenceArray as argument. For itembased recommenders, its >>>>> straightforward to compute recommendations, for userbased you have to >>>>> search through all users once, for latent factor models, you have to >>>>> fold >>>>> the user vector into the low dimensional space. >>>>> >>>>> I think Sean already added this method in myrrix and I have some code >>>>> for >>>>> my kornakapi project (a simple weblayer for mahout). >>>>> >>>>> Would such a method fit your needs? >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>> >>>>> 2013/7/17 Pat Ferrel <[email protected]> >>>>> >>>>> May I ask how you plan to support model updates and 'anonymous' users? >>>>>> >>>>>> I assume the latent factors model is calculated offline still in batch >>>>>> mode, then there are periodic updates? How are the updates handled? Do >>>>>> >>>>> you >>>>> >>>>>> plan to require batch model refactorization for any update? Or perform >>>>>> >>>>> some >>>>> >>>>>> partial update by maybe just transforming new data into the LF space >>>>>> already in place then doing full refactorization every so often in >>>>>> batch >>>>>> mode? >>>>>> >>>>>> By 'anonymous users' I mean users with some history that is not yet >>>>>> incorporated in the LF model. This could be history from a new user >>>>>> asked >>>>>> to pick a few items to start the rec process, or an old user with some >>>>>> >>>>> new >>>>> >>>>>> action history not yet in the model. Are you going to allow for >>>>>> passing >>>>>> >>>>> the >>>>> >>>>>> entire history vector or userID+incremental new history to the >>>>>> >>>>> recommender? >>>>> >>>>>> I hope so. >>>>>> >>>>>> For what it's worth we did a comparison of Mahout Item based CF to >>>>>> Mahout >>>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 >>>>>> months >>>>>> >>>>> of >>>>> >>>>>> data. The data was purchase data from a diverse ecom source with a >>>>>> large >>>>>> variety of products from electronics to clothes. We found Item based >>>>>> CF >>>>>> >>>>> did >>>>> >>>>>> far better than ALS. As we increased the number of latent factors the >>>>>> results got better but were never within 10% of item based (we used >>>>>> MAP >>>>>> >>>>> as >>>>> >>>>>> the offline metric). Not sure why but maybe it has to do with the >>>>>> >>>>> diversity >>>>> >>>>>> of the item types. >>>>>> >>>>>> I understand that a full item based online recommender has very >>>>>> different >>>>>> tradeoffs and anyway others may not have seen this disparity of >>>>>> results. >>>>>> Furthermore we don't have A/B test results yet to validate the offline >>>>>> metric. >>>>>> >>>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <[email protected]> wrote: >>>>>> >>>>>> Peng, >>>>>> >>>>>> This is the reason I separated out the DataModel, and only put the >>>>>> >>>>> learner >>>>> >>>>>> stuff there. The learner I mentioned yesterday just stores the >>>>>> parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not >>>>>> care >>>>>> where preferences are stored. >>>>>> >>>>>> I, kind of, agree with the multi-level DataModel approach: >>>>>> One for iterating over "all" preferences, one for if one wants to >>>>>> deploy >>>>>> >>>>> a >>>>> >>>>>> recommender and perform a lot of top-N recommendation tasks. >>>>>> >>>>>> (Or one DataModel with a strategy that might reduce existing memory >>>>>> consumption, while still providing fast access, I am not sure. Let me >>>>>> >>>>> try a >>>>> >>>>>> matrix-backed DataModel approach) >>>>>> >>>>>> Gokhan >>>>>> >>>>>> >>>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <[email protected]> >>>>>> wrote: >>>>>> >>>>>> I completely agree, Netflix is less than one gigabye in a smart >>>>>>> representation, 12x more memory is a nogo. The techniques used in >>>>>>> FactorizablePreferences allow a much more memory efficient >>>>>>> >>>>>> representation, >>>>>> >>>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and >>>>>>> fits >>>>>>> >>>>>> into >>>>>> >>>>>>> 3GB with that approach. >>>>>>> >>>>>>> >>>>>>> 2013/7/16 Ted Dunning <[email protected]> >>>>>>> >>>>>>> Netflix is a small dataset. 12G for that seems quite excessive. >>>>>>>> >>>>>>>> Note also that this is before you have done any work. >>>>>>>> >>>>>>>> Ideally, 100million observations should take << 1GB. >>>>>>>> >>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <[email protected]> >>>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> The second idea is indeed splendid, we should separate >>>>>>>>> time-complexity >>>>>>>>> first and space-complexity first implementation. What I'm not quite >>>>>>>>> >>>>>>>> sure, >>>>>>> >>>>>>>> is that if we really need to create two interfaces instead of one. >>>>>>>>> Personally, I think 12G heap space is not that high right? Most new >>>>>>>>> >>>>>>>> laptop >>>>>>>> >>>>>>>>> can already handle that (emphasis on laptop). And if we replace >>>>>>>>> hash >>>>>>>>> >>>>>>>> map >>>>>>> >>>>>>>> (the culprit of high memory consumption) with list/linkedList, it >>>>>>>>> >>>>>>>> would >>>>> >>>>>> simply degrade time complexity for a linear search to O(n), not too >>>>>>>>> >>>>>>>> bad >>>>> >>>>>> either. The current DataModel is a result of careful thoughts and has >>>>>>>>> underwent extensive test, it is easier to expand on top of it >>>>>>>>> instead >>>>>>>>> >>>>>>>> of >>>>>>> >>>>>>>> subverting it. >>>>>>>>> >>>>>>>> >>> >>> >> >> >> > >
