I just started to implement a Matrix backed data model and pushed it, to
check the performance and memory considerations.

I believe I can try it on some data tomorrow.

Best

Gokhan


On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng <[email protected]> wrote:

> I see, sorry I was too presumptuous. I only recently worked and tested
> SVDRecommender, never could have known its efficiency using an item-based
> recommender. Maybe there is space for algorithmic optimization.
>
> The online recommender Gokhan is working on is also an SVDRecommender. An
> online user-based or item-based recommender based on clustering technique
> would definitely be critical, but we need an expert to volunteer :)
>
> Perhaps Dr Dunning can have a few words? He announced the online
> clustering component.
>
> Yours Peng
>
>
> On 13-07-18 03:54 PM, Pat Ferrel wrote:
>
>> No it was CPU bound not memory. I gave it something like 14G heap. It was
>> running, just too slow to be of any real use. We switched to the hadoop
>> version and stored precalculated recs in a db for every user.
>>
>> On Jul 18, 2013, at 12:06 PM, Peng Cheng <[email protected]> wrote:
>>
>> Strange, its just a little bit larger than limibseti dataset (17m
>> ratings), did you encountered an outOfMemory or GCTimeOut exception?
>> Allocating more heap space usually help.
>>
>> Yours Peng
>>
>> On 13-07-18 02:27 PM, Pat Ferrel wrote:
>>
>>> It was about 2.5M users and 500K items with 25M actions over 6 months of
>>> data.
>>>
>>> On Jul 18, 2013, at 10:15 AM, Peng Cheng <[email protected]> wrote:
>>>
>>> If I remember right, a highlight of 0.8 release is an online clustering
>>> algorithm. I'm not sure if it can be used in item-based recommender, but
>>> this is definitely I would like to pursue. It's probably the only advantage
>>> a non-hadoop implementation can offer in the future.
>>>
>>> Many non-hadoop recommenders are pretty fast. But existing in-memory
>>> GenericDataModel and FileDataModel are largely implemented for sandboxes,
>>> IMHO they are the culprit of scalability problem.
>>>
>>> May I ask about the scale of your dataset? how many rating does it have?
>>>
>>> Yours Peng
>>>
>>> On 13-07-18 12:14 PM, Sebastian Schelter wrote:
>>>
>>>> Well, with itembased the only problem is new items. New users can
>>>> immediately be served by the model (although this is not well supported
>>>> by
>>>> the API in Mahout). For the majority of usecases I saw, it is perfectly
>>>> fine to have a short delay until new items "enter" the recommender,
>>>> usually
>>>> this happens after a retraining in batch. You have to care for
>>>> cold-start
>>>> and collect some interactions anyway.
>>>>
>>>>
>>>> 2013/7/18 Pat Ferrel <[email protected]>
>>>>
>>>>  Yes, what Myrrix does is good.
>>>>>
>>>>> My last aside was a wish for an item-based online recommender not only
>>>>> factorized. Ted talks about using Solr for this, which we're
>>>>> experimenting
>>>>> with alongside Myrrix. I suspect Solr works but it does require a bit
>>>>> of
>>>>> tinkering and doesn't have quite the same set of options--no llr
>>>>> similarity
>>>>> for instance.
>>>>>
>>>>> On the same subject I recently attended a workshop in Seattle for
>>>>> UAI2013
>>>>> where Walmart reported similar results using a factorized recommender.
>>>>> They
>>>>> had to increase the factor number past where it would perform well.
>>>>> Along
>>>>> the way they saw increasing performance measuring precision offline.
>>>>> They
>>>>> eventually gave up on a factorized solution. This decision seems odd
>>>>> but
>>>>> anyway… In the case of Walmart and our data set they are quite
>>>>> diverse. The
>>>>> best idea is probably to create different recommenders for separate
>>>>> parts
>>>>> of the catalog but if you create one model on all items our intuition
>>>>> is
>>>>> that item-based works better than factorized. Again caveat--no A/B
>>>>> tests to
>>>>> support this yet.
>>>>>
>>>>> Doing an online item-based recommender would quickly run into scaling
>>>>> problems, no? We put together the simple Mahout in-memory version and
>>>>> it
>>>>> could not really handle more than a down-sampled few months of our
>>>>> data.
>>>>> Down-sampling lost us 20% of our precision scores so we moved to the
>>>>> hadoop
>>>>> version. Now we have use-cases for an online recommender that handles
>>>>> anonymous new users and that takes the story full circle.
>>>>>
>>>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi Pat
>>>>>
>>>>> I think we should provide a simple support for recommending to
>>>>> anonymous
>>>>> users. We should have a method recommendToAnonymous() that takes a
>>>>> PreferenceArray as argument. For itembased recommenders, its
>>>>> straightforward to compute recommendations, for userbased you have to
>>>>> search through all users once, for latent factor models, you have to
>>>>> fold
>>>>> the user vector into the low dimensional space.
>>>>>
>>>>> I think Sean already added this method in myrrix and I have some code
>>>>> for
>>>>> my kornakapi project (a simple weblayer for mahout).
>>>>>
>>>>> Would such a method fit your needs?
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>>
>>>>> 2013/7/17 Pat Ferrel <[email protected]>
>>>>>
>>>>>  May I ask how you plan to support model updates and 'anonymous' users?
>>>>>>
>>>>>> I assume the latent factors model is calculated offline still in batch
>>>>>> mode, then there are periodic updates? How are the updates handled? Do
>>>>>>
>>>>> you
>>>>>
>>>>>> plan to require batch model refactorization for any update? Or perform
>>>>>>
>>>>> some
>>>>>
>>>>>> partial update by maybe just transforming new data into the LF space
>>>>>> already in place then doing full refactorization every so often in
>>>>>> batch
>>>>>> mode?
>>>>>>
>>>>>> By 'anonymous users' I mean users with some history that is not yet
>>>>>> incorporated in the LF model. This could be history from a new user
>>>>>> asked
>>>>>> to pick a few items to start the rec process, or an old user with some
>>>>>>
>>>>> new
>>>>>
>>>>>> action history not yet in the model. Are you going to allow for
>>>>>> passing
>>>>>>
>>>>> the
>>>>>
>>>>>> entire history vector or userID+incremental new history to the
>>>>>>
>>>>> recommender?
>>>>>
>>>>>> I hope so.
>>>>>>
>>>>>> For what it's worth we did a comparison of Mahout Item based CF to
>>>>>> Mahout
>>>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6
>>>>>> months
>>>>>>
>>>>> of
>>>>>
>>>>>> data. The data was purchase data from a diverse ecom source with a
>>>>>> large
>>>>>> variety of products from electronics to clothes. We found Item based
>>>>>> CF
>>>>>>
>>>>> did
>>>>>
>>>>>> far better than ALS. As we increased the number of latent factors the
>>>>>> results got better but were never within 10% of item based (we used
>>>>>> MAP
>>>>>>
>>>>> as
>>>>>
>>>>>> the offline metric). Not sure why but maybe it has to do with the
>>>>>>
>>>>> diversity
>>>>>
>>>>>> of the item types.
>>>>>>
>>>>>> I understand that a full item based online recommender has very
>>>>>> different
>>>>>> tradeoffs and anyway others may not have seen this disparity of
>>>>>> results.
>>>>>> Furthermore we don't have A/B test results yet to validate the offline
>>>>>> metric.
>>>>>>
>>>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <[email protected]> wrote:
>>>>>>
>>>>>> Peng,
>>>>>>
>>>>>> This is the reason I separated out the DataModel, and only put the
>>>>>>
>>>>> learner
>>>>>
>>>>>> stuff there. The learner I mentioned yesterday just stores the
>>>>>> parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not
>>>>>> care
>>>>>> where preferences are stored.
>>>>>>
>>>>>> I, kind of, agree with the multi-level DataModel approach:
>>>>>> One for iterating over "all" preferences, one for if one wants to
>>>>>> deploy
>>>>>>
>>>>> a
>>>>>
>>>>>> recommender and perform a lot of top-N recommendation tasks.
>>>>>>
>>>>>> (Or one DataModel with a strategy that might reduce existing memory
>>>>>> consumption, while still providing fast access, I am not sure. Let me
>>>>>>
>>>>> try a
>>>>>
>>>>>> matrix-backed DataModel approach)
>>>>>>
>>>>>> Gokhan
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>  I completely agree, Netflix is less than one gigabye in a smart
>>>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>>>> FactorizablePreferences allow a much more memory efficient
>>>>>>>
>>>>>> representation,
>>>>>>
>>>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and
>>>>>>> fits
>>>>>>>
>>>>>> into
>>>>>>
>>>>>>> 3GB with that approach.
>>>>>>>
>>>>>>>
>>>>>>> 2013/7/16 Ted Dunning <[email protected]>
>>>>>>>
>>>>>>>  Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>>>>
>>>>>>>> Note also that this is before you have done any work.
>>>>>>>>
>>>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>>>
>>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <[email protected]>
>>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The second idea is indeed splendid, we should separate
>>>>>>>>> time-complexity
>>>>>>>>> first and space-complexity first implementation. What I'm not quite
>>>>>>>>>
>>>>>>>> sure,
>>>>>>>
>>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>>>>>
>>>>>>>> laptop
>>>>>>>>
>>>>>>>>> can already handle that (emphasis on laptop). And if we replace
>>>>>>>>> hash
>>>>>>>>>
>>>>>>>> map
>>>>>>>
>>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>>>>>
>>>>>>>> would
>>>>>
>>>>>>  simply degrade time complexity for a linear search to O(n), not too
>>>>>>>>>
>>>>>>>> bad
>>>>>
>>>>>>  either. The current DataModel is a result of careful thoughts and has
>>>>>>>>> underwent extensive test, it is easier to expand on top of it
>>>>>>>>> instead
>>>>>>>>>
>>>>>>>> of
>>>>>>>
>>>>>>>> subverting it.
>>>>>>>>>
>>>>>>>>
>>>
>>>
>>
>>
>>
>
>

Reply via email to