It is 2 SparseRowMatrices, Peng. But I don't want to comment on it before actually trying it. This is essentially a first step for me to choose my side on the DataModel implementation discussion:)
Gokhan On Fri, Jul 19, 2013 at 2:25 AM, Peng Cheng <[email protected]> wrote: > Wow, that's lightning fast. > > Is it a SparseMatrix or DenseMatrix? > > > On 13-07-18 07:23 PM, Gokhan Capan wrote: > >> I just started to implement a Matrix backed data model and pushed it, to >> check the performance and memory considerations. >> >> I believe I can try it on some data tomorrow. >> >> Best >> >> Gokhan >> >> >> On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng <[email protected]> >> wrote: >> >> I see, sorry I was too presumptuous. I only recently worked and tested >>> SVDRecommender, never could have known its efficiency using an item-based >>> recommender. Maybe there is space for algorithmic optimization. >>> >>> The online recommender Gokhan is working on is also an SVDRecommender. An >>> online user-based or item-based recommender based on clustering technique >>> would definitely be critical, but we need an expert to volunteer :) >>> >>> Perhaps Dr Dunning can have a few words? He announced the online >>> clustering component. >>> >>> Yours Peng >>> >>> >>> On 13-07-18 03:54 PM, Pat Ferrel wrote: >>> >>> No it was CPU bound not memory. I gave it something like 14G heap. It >>>> was >>>> running, just too slow to be of any real use. We switched to the hadoop >>>> version and stored precalculated recs in a db for every user. >>>> >>>> On Jul 18, 2013, at 12:06 PM, Peng Cheng <[email protected]> wrote: >>>> >>>> Strange, its just a little bit larger than limibseti dataset (17m >>>> ratings), did you encountered an outOfMemory or GCTimeOut exception? >>>> Allocating more heap space usually help. >>>> >>>> Yours Peng >>>> >>>> On 13-07-18 02:27 PM, Pat Ferrel wrote: >>>> >>>> It was about 2.5M users and 500K items with 25M actions over 6 months >>>>> of >>>>> data. >>>>> >>>>> On Jul 18, 2013, at 10:15 AM, Peng Cheng <[email protected]> wrote: >>>>> >>>>> If I remember right, a highlight of 0.8 release is an online clustering >>>>> algorithm. I'm not sure if it can be used in item-based recommender, >>>>> but >>>>> this is definitely I would like to pursue. It's probably the only >>>>> advantage >>>>> a non-hadoop implementation can offer in the future. >>>>> >>>>> Many non-hadoop recommenders are pretty fast. But existing in-memory >>>>> GenericDataModel and FileDataModel are largely implemented for >>>>> sandboxes, >>>>> IMHO they are the culprit of scalability problem. >>>>> >>>>> May I ask about the scale of your dataset? how many rating does it >>>>> have? >>>>> >>>>> Yours Peng >>>>> >>>>> On 13-07-18 12:14 PM, Sebastian Schelter wrote: >>>>> >>>>> Well, with itembased the only problem is new items. New users can >>>>>> immediately be served by the model (although this is not well >>>>>> supported >>>>>> by >>>>>> the API in Mahout). For the majority of usecases I saw, it is >>>>>> perfectly >>>>>> fine to have a short delay until new items "enter" the recommender, >>>>>> usually >>>>>> this happens after a retraining in batch. You have to care for >>>>>> cold-start >>>>>> and collect some interactions anyway. >>>>>> >>>>>> >>>>>> 2013/7/18 Pat Ferrel <[email protected]> >>>>>> >>>>>> Yes, what Myrrix does is good. >>>>>> >>>>>>> My last aside was a wish for an item-based online recommender not >>>>>>> only >>>>>>> factorized. Ted talks about using Solr for this, which we're >>>>>>> experimenting >>>>>>> with alongside Myrrix. I suspect Solr works but it does require a bit >>>>>>> of >>>>>>> tinkering and doesn't have quite the same set of options--no llr >>>>>>> similarity >>>>>>> for instance. >>>>>>> >>>>>>> On the same subject I recently attended a workshop in Seattle for >>>>>>> UAI2013 >>>>>>> where Walmart reported similar results using a factorized >>>>>>> recommender. >>>>>>> They >>>>>>> had to increase the factor number past where it would perform well. >>>>>>> Along >>>>>>> the way they saw increasing performance measuring precision offline. >>>>>>> They >>>>>>> eventually gave up on a factorized solution. This decision seems odd >>>>>>> but >>>>>>> anyway… In the case of Walmart and our data set they are quite >>>>>>> diverse. The >>>>>>> best idea is probably to create different recommenders for separate >>>>>>> parts >>>>>>> of the catalog but if you create one model on all items our intuition >>>>>>> is >>>>>>> that item-based works better than factorized. Again caveat--no A/B >>>>>>> tests to >>>>>>> support this yet. >>>>>>> >>>>>>> Doing an online item-based recommender would quickly run into scaling >>>>>>> problems, no? We put together the simple Mahout in-memory version and >>>>>>> it >>>>>>> could not really handle more than a down-sampled few months of our >>>>>>> data. >>>>>>> Down-sampling lost us 20% of our precision scores so we moved to the >>>>>>> hadoop >>>>>>> version. Now we have use-cases for an online recommender that handles >>>>>>> anonymous new users and that takes the story full circle. >>>>>>> >>>>>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Hi Pat >>>>>>> >>>>>>> I think we should provide a simple support for recommending to >>>>>>> anonymous >>>>>>> users. We should have a method recommendToAnonymous() that takes a >>>>>>> PreferenceArray as argument. For itembased recommenders, its >>>>>>> straightforward to compute recommendations, for userbased you have to >>>>>>> search through all users once, for latent factor models, you have to >>>>>>> fold >>>>>>> the user vector into the low dimensional space. >>>>>>> >>>>>>> I think Sean already added this method in myrrix and I have some code >>>>>>> for >>>>>>> my kornakapi project (a simple weblayer for mahout). >>>>>>> >>>>>>> Would such a method fit your needs? >>>>>>> >>>>>>> Best, >>>>>>> Sebastian >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2013/7/17 Pat Ferrel <[email protected]> >>>>>>> >>>>>>> May I ask how you plan to support model updates and 'anonymous' >>>>>>> users? >>>>>>> >>>>>>>> I assume the latent factors model is calculated offline still in >>>>>>>> batch >>>>>>>> mode, then there are periodic updates? How are the updates handled? >>>>>>>> Do >>>>>>>> >>>>>>>> you >>>>>>> >>>>>>> plan to require batch model refactorization for any update? Or >>>>>>>> perform >>>>>>>> >>>>>>>> some >>>>>>> >>>>>>> partial update by maybe just transforming new data into the LF space >>>>>>>> already in place then doing full refactorization every so often in >>>>>>>> batch >>>>>>>> mode? >>>>>>>> >>>>>>>> By 'anonymous users' I mean users with some history that is not yet >>>>>>>> incorporated in the LF model. This could be history from a new user >>>>>>>> asked >>>>>>>> to pick a few items to start the rec process, or an old user with >>>>>>>> some >>>>>>>> >>>>>>>> new >>>>>>> >>>>>>> action history not yet in the model. Are you going to allow for >>>>>>>> passing >>>>>>>> >>>>>>>> the >>>>>>> >>>>>>> entire history vector or userID+incremental new history to the >>>>>>>> >>>>>>>> recommender? >>>>>>> >>>>>>> I hope so. >>>>>>>> >>>>>>>> For what it's worth we did a comparison of Mahout Item based CF to >>>>>>>> Mahout >>>>>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 >>>>>>>> months >>>>>>>> >>>>>>>> of >>>>>>> >>>>>>> data. The data was purchase data from a diverse ecom source with a >>>>>>>> large >>>>>>>> variety of products from electronics to clothes. We found Item based >>>>>>>> CF >>>>>>>> >>>>>>>> did >>>>>>> >>>>>>> far better than ALS. As we increased the number of latent factors >>>>>>>> the >>>>>>>> results got better but were never within 10% of item based (we used >>>>>>>> MAP >>>>>>>> >>>>>>>> as >>>>>>> >>>>>>> the offline metric). Not sure why but maybe it has to do with the >>>>>>>> >>>>>>>> diversity >>>>>>> >>>>>>> of the item types. >>>>>>>> >>>>>>>> I understand that a full item based online recommender has very >>>>>>>> different >>>>>>>> tradeoffs and anyway others may not have seen this disparity of >>>>>>>> results. >>>>>>>> Furthermore we don't have A/B test results yet to validate the >>>>>>>> offline >>>>>>>> metric. >>>>>>>> >>>>>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Peng, >>>>>>>> >>>>>>>> This is the reason I separated out the DataModel, and only put the >>>>>>>> >>>>>>>> learner >>>>>>> >>>>>>> stuff there. The learner I mentioned yesterday just stores the >>>>>>>> parameters, (noOfUsers+noOfItems)*****noOfLatentFactors, and does >>>>>>>> not >>>>>>>> >>>>>>>> care >>>>>>>> where preferences are stored. >>>>>>>> >>>>>>>> I, kind of, agree with the multi-level DataModel approach: >>>>>>>> One for iterating over "all" preferences, one for if one wants to >>>>>>>> deploy >>>>>>>> >>>>>>>> a >>>>>>> >>>>>>> recommender and perform a lot of top-N recommendation tasks. >>>>>>>> >>>>>>>> (Or one DataModel with a strategy that might reduce existing memory >>>>>>>> consumption, while still providing fast access, I am not sure. Let >>>>>>>> me >>>>>>>> >>>>>>>> try a >>>>>>> >>>>>>> matrix-backed DataModel approach) >>>>>>>> >>>>>>>> Gokhan >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <[email protected] >>>>>>>> > >>>>>>>> wrote: >>>>>>>> >>>>>>>> I completely agree, Netflix is less than one gigabye in a smart >>>>>>>> >>>>>>>>> representation, 12x more memory is a nogo. The techniques used in >>>>>>>>> FactorizablePreferences allow a much more memory efficient >>>>>>>>> >>>>>>>>> representation, >>>>>>>> >>>>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and >>>>>>>>> fits >>>>>>>>> >>>>>>>>> into >>>>>>>> >>>>>>>> 3GB with that approach. >>>>>>>>> >>>>>>>>> >>>>>>>>> 2013/7/16 Ted Dunning <[email protected]> >>>>>>>>> >>>>>>>>> Netflix is a small dataset. 12G for that seems quite excessive. >>>>>>>>> >>>>>>>>>> Note also that this is before you have done any work. >>>>>>>>>> >>>>>>>>>> Ideally, 100million observations should take << 1GB. >>>>>>>>>> >>>>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <[email protected] >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> The second idea is indeed splendid, we should separate >>>>>>>>>> >>>>>>>>>>> time-complexity >>>>>>>>>>> first and space-complexity first implementation. What I'm not >>>>>>>>>>> quite >>>>>>>>>>> >>>>>>>>>>> sure, >>>>>>>>>> is that if we really need to create two interfaces instead of one. >>>>>>>>>> >>>>>>>>>>> Personally, I think 12G heap space is not that high right? Most >>>>>>>>>>> new >>>>>>>>>>> >>>>>>>>>>> laptop >>>>>>>>>> >>>>>>>>>> can already handle that (emphasis on laptop). And if we replace >>>>>>>>>>> hash >>>>>>>>>>> >>>>>>>>>>> map >>>>>>>>>> (the culprit of high memory consumption) with list/linkedList, it >>>>>>>>>> would >>>>>>>>>> >>>>>>>>> simply degrade time complexity for a linear search to O(n), not >>>>>>>> too >>>>>>>> >>>>>>>>> bad >>>>>>>>>> >>>>>>>>> either. The current DataModel is a result of careful thoughts >>>>>>>> and has >>>>>>>> >>>>>>>>> underwent extensive test, it is easier to expand on top of it >>>>>>>>>>> instead >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>> subverting it. >>>>>>>>>> >>>>>>>>> >>>>> >>>> >>>> >>> > >
