It was about 2.5M users and 500K items with 25M actions over 6 months of data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng <[email protected]> wrote:

If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but this 
is definitely I would like to pursue. It's probably the only advantage a 
non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO 
they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:
> Well, with itembased the only problem is new items. New users can
> immediately be served by the model (although this is not well supported by
> the API in Mahout). For the majority of usecases I saw, it is perfectly
> fine to have a short delay until new items "enter" the recommender, usually
> this happens after a retraining in batch. You have to care for cold-start
> and collect some interactions anyway.
> 
> 
> 2013/7/18 Pat Ferrel <[email protected]>
> 
>> Yes, what Myrrix does is good.
>> 
>> My last aside was a wish for an item-based online recommender not only
>> factorized. Ted talks about using Solr for this, which we're experimenting
>> with alongside Myrrix. I suspect Solr works but it does require a bit of
>> tinkering and doesn't have quite the same set of options--no llr similarity
>> for instance.
>> 
>> On the same subject I recently attended a workshop in Seattle for UAI2013
>> where Walmart reported similar results using a factorized recommender. They
>> had to increase the factor number past where it would perform well. Along
>> the way they saw increasing performance measuring precision offline. They
>> eventually gave up on a factorized solution. This decision seems odd but
>> anyway… In the case of Walmart and our data set they are quite diverse. The
>> best idea is probably to create different recommenders for separate parts
>> of the catalog but if you create one model on all items our intuition is
>> that item-based works better than factorized. Again caveat--no A/B tests to
>> support this yet.
>> 
>> Doing an online item-based recommender would quickly run into scaling
>> problems, no? We put together the simple Mahout in-memory version and it
>> could not really handle more than a down-sampled few months of our data.
>> Down-sampling lost us 20% of our precision scores so we moved to the hadoop
>> version. Now we have use-cases for an online recommender that handles
>> anonymous new users and that takes the story full circle.
>> 
>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <[email protected]> wrote:
>> 
>> Hi Pat
>> 
>> I think we should provide a simple support for recommending to anonymous
>> users. We should have a method recommendToAnonymous() that takes a
>> PreferenceArray as argument. For itembased recommenders, its
>> straightforward to compute recommendations, for userbased you have to
>> search through all users once, for latent factor models, you have to fold
>> the user vector into the low dimensional space.
>> 
>> I think Sean already added this method in myrrix and I have some code for
>> my kornakapi project (a simple weblayer for mahout).
>> 
>> Would such a method fit your needs?
>> 
>> Best,
>> Sebastian
>> 
>> 
>> 
>> 2013/7/17 Pat Ferrel <[email protected]>
>> 
>>> May I ask how you plan to support model updates and 'anonymous' users?
>>> 
>>> I assume the latent factors model is calculated offline still in batch
>>> mode, then there are periodic updates? How are the updates handled? Do
>> you
>>> plan to require batch model refactorization for any update? Or perform
>> some
>>> partial update by maybe just transforming new data into the LF space
>>> already in place then doing full refactorization every so often in batch
>>> mode?
>>> 
>>> By 'anonymous users' I mean users with some history that is not yet
>>> incorporated in the LF model. This could be history from a new user asked
>>> to pick a few items to start the rec process, or an old user with some
>> new
>>> action history not yet in the model. Are you going to allow for passing
>> the
>>> entire history vector or userID+incremental new history to the
>> recommender?
>>> I hope so.
>>> 
>>> For what it's worth we did a comparison of Mahout Item based CF to Mahout
>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
>> of
>>> data. The data was purchase data from a diverse ecom source with a large
>>> variety of products from electronics to clothes. We found Item based CF
>> did
>>> far better than ALS. As we increased the number of latent factors the
>>> results got better but were never within 10% of item based (we used MAP
>> as
>>> the offline metric). Not sure why but maybe it has to do with the
>> diversity
>>> of the item types.
>>> 
>>> I understand that a full item based online recommender has very different
>>> tradeoffs and anyway others may not have seen this disparity of results.
>>> Furthermore we don't have A/B test results yet to validate the offline
>>> metric.
>>> 
>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <[email protected]> wrote:
>>> 
>>> Peng,
>>> 
>>> This is the reason I separated out the DataModel, and only put the
>> learner
>>> stuff there. The learner I mentioned yesterday just stores the
>>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>>> where preferences are stored.
>>> 
>>> I, kind of, agree with the multi-level DataModel approach:
>>> One for iterating over "all" preferences, one for if one wants to deploy
>> a
>>> recommender and perform a lot of top-N recommendation tasks.
>>> 
>>> (Or one DataModel with a strategy that might reduce existing memory
>>> consumption, while still providing fast access, I am not sure. Let me
>> try a
>>> matrix-backed DataModel approach)
>>> 
>>> Gokhan
>>> 
>>> 
>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <[email protected]>
>>> wrote:
>>> 
>>>> I completely agree, Netflix is less than one gigabye in a smart
>>>> representation, 12x more memory is a nogo. The techniques used in
>>>> FactorizablePreferences allow a much more memory efficient
>>> representation,
>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>> into
>>>> 3GB with that approach.
>>>> 
>>>> 
>>>> 2013/7/16 Ted Dunning <[email protected]>
>>>> 
>>>>> Netflix is a small dataset.  12G for that seems quite excessive.
>>>>> 
>>>>> Note also that this is before you have done any work.
>>>>> 
>>>>> Ideally, 100million observations should take << 1GB.
>>>>> 
>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <[email protected]>
>>>> wrote:
>>>>>> The second idea is indeed splendid, we should separate time-complexity
>>>>>> first and space-complexity first implementation. What I'm not quite
>>>> sure,
>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>> laptop
>>>>>> can already handle that (emphasis on laptop). And if we replace hash
>>>> map
>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>> would
>>>>>> simply degrade time complexity for a linear search to O(n), not too
>> bad
>>>>>> either. The current DataModel is a result of careful thoughts and has
>>>>>> underwent extensive test, it is easier to expand on top of it instead
>>>> of
>>>>>> subverting it.
>>> 
>> 



Reply via email to