It is 2 SparseRowMatrices, Peng. But I don't want to comment on it before
actually trying it. This is essentially a first step for me to choose my
side on the DataModel implementation discussion:)

Gokhan

On Fri, Jul 19, 2013 at 2:25 AM, Peng Cheng <[email protected]> wrote:

> Wow, that's lightning fast.
>
> Is it a SparseMatrix or DenseMatrix?
>
>
> On 13-07-18 07:23 PM, Gokhan Capan wrote:
>
>> I just started to implement a Matrix backed data model and pushed it, to
>> check the performance and memory considerations.
>>
>> I believe I can try it on some data tomorrow.
>>
>> Best
>>
>> Gokhan
>>
>>
>> On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng <[email protected]>
>> wrote:
>>
>>  I see, sorry I was too presumptuous. I only recently worked and tested
>>> SVDRecommender, never could have known its efficiency using an item-based
>>> recommender. Maybe there is space for algorithmic optimization.
>>>
>>> The online recommender Gokhan is working on is also an SVDRecommender. An
>>> online user-based or item-based recommender based on clustering technique
>>> would definitely be critical, but we need an expert to volunteer :)
>>>
>>> Perhaps Dr Dunning can have a few words? He announced the online
>>> clustering component.
>>>
>>> Yours Peng
>>>
>>>
>>> On 13-07-18 03:54 PM, Pat Ferrel wrote:
>>>
>>>  No it was CPU bound not memory. I gave it something like 14G heap. It
>>>> was
>>>> running, just too slow to be of any real use. We switched to the hadoop
>>>> version and stored precalculated recs in a db for every user.
>>>>
>>>> On Jul 18, 2013, at 12:06 PM, Peng Cheng <[email protected]> wrote:
>>>>
>>>> Strange, its just a little bit larger than limibseti dataset (17m
>>>> ratings), did you encountered an outOfMemory or GCTimeOut exception?
>>>> Allocating more heap space usually help.
>>>>
>>>> Yours Peng
>>>>
>>>> On 13-07-18 02:27 PM, Pat Ferrel wrote:
>>>>
>>>>  It was about 2.5M users and 500K items with 25M actions over 6 months
>>>>> of
>>>>> data.
>>>>>
>>>>> On Jul 18, 2013, at 10:15 AM, Peng Cheng <[email protected]> wrote:
>>>>>
>>>>> If I remember right, a highlight of 0.8 release is an online clustering
>>>>> algorithm. I'm not sure if it can be used in item-based recommender,
>>>>> but
>>>>> this is definitely I would like to pursue. It's probably the only
>>>>> advantage
>>>>> a non-hadoop implementation can offer in the future.
>>>>>
>>>>> Many non-hadoop recommenders are pretty fast. But existing in-memory
>>>>> GenericDataModel and FileDataModel are largely implemented for
>>>>> sandboxes,
>>>>> IMHO they are the culprit of scalability problem.
>>>>>
>>>>> May I ask about the scale of your dataset? how many rating does it
>>>>> have?
>>>>>
>>>>> Yours Peng
>>>>>
>>>>> On 13-07-18 12:14 PM, Sebastian Schelter wrote:
>>>>>
>>>>>  Well, with itembased the only problem is new items. New users can
>>>>>> immediately be served by the model (although this is not well
>>>>>> supported
>>>>>> by
>>>>>> the API in Mahout). For the majority of usecases I saw, it is
>>>>>> perfectly
>>>>>> fine to have a short delay until new items "enter" the recommender,
>>>>>> usually
>>>>>> this happens after a retraining in batch. You have to care for
>>>>>> cold-start
>>>>>> and collect some interactions anyway.
>>>>>>
>>>>>>
>>>>>> 2013/7/18 Pat Ferrel <[email protected]>
>>>>>>
>>>>>>   Yes, what Myrrix does is good.
>>>>>>
>>>>>>> My last aside was a wish for an item-based online recommender not
>>>>>>> only
>>>>>>> factorized. Ted talks about using Solr for this, which we're
>>>>>>> experimenting
>>>>>>> with alongside Myrrix. I suspect Solr works but it does require a bit
>>>>>>> of
>>>>>>> tinkering and doesn't have quite the same set of options--no llr
>>>>>>> similarity
>>>>>>> for instance.
>>>>>>>
>>>>>>> On the same subject I recently attended a workshop in Seattle for
>>>>>>> UAI2013
>>>>>>> where Walmart reported similar results using a factorized
>>>>>>> recommender.
>>>>>>> They
>>>>>>> had to increase the factor number past where it would perform well.
>>>>>>> Along
>>>>>>> the way they saw increasing performance measuring precision offline.
>>>>>>> They
>>>>>>> eventually gave up on a factorized solution. This decision seems odd
>>>>>>> but
>>>>>>> anyway… In the case of Walmart and our data set they are quite
>>>>>>> diverse. The
>>>>>>> best idea is probably to create different recommenders for separate
>>>>>>> parts
>>>>>>> of the catalog but if you create one model on all items our intuition
>>>>>>> is
>>>>>>> that item-based works better than factorized. Again caveat--no A/B
>>>>>>> tests to
>>>>>>> support this yet.
>>>>>>>
>>>>>>> Doing an online item-based recommender would quickly run into scaling
>>>>>>> problems, no? We put together the simple Mahout in-memory version and
>>>>>>> it
>>>>>>> could not really handle more than a down-sampled few months of our
>>>>>>> data.
>>>>>>> Down-sampling lost us 20% of our precision scores so we moved to the
>>>>>>> hadoop
>>>>>>> version. Now we have use-cases for an online recommender that handles
>>>>>>> anonymous new users and that takes the story full circle.
>>>>>>>
>>>>>>> On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Pat
>>>>>>>
>>>>>>> I think we should provide a simple support for recommending to
>>>>>>> anonymous
>>>>>>> users. We should have a method recommendToAnonymous() that takes a
>>>>>>> PreferenceArray as argument. For itembased recommenders, its
>>>>>>> straightforward to compute recommendations, for userbased you have to
>>>>>>> search through all users once, for latent factor models, you have to
>>>>>>> fold
>>>>>>> the user vector into the low dimensional space.
>>>>>>>
>>>>>>> I think Sean already added this method in myrrix and I have some code
>>>>>>> for
>>>>>>> my kornakapi project (a simple weblayer for mahout).
>>>>>>>
>>>>>>> Would such a method fit your needs?
>>>>>>>
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2013/7/17 Pat Ferrel <[email protected]>
>>>>>>>
>>>>>>>   May I ask how you plan to support model updates and 'anonymous'
>>>>>>> users?
>>>>>>>
>>>>>>>> I assume the latent factors model is calculated offline still in
>>>>>>>> batch
>>>>>>>> mode, then there are periodic updates? How are the updates handled?
>>>>>>>> Do
>>>>>>>>
>>>>>>>>  you
>>>>>>>
>>>>>>>  plan to require batch model refactorization for any update? Or
>>>>>>>> perform
>>>>>>>>
>>>>>>>>  some
>>>>>>>
>>>>>>>  partial update by maybe just transforming new data into the LF space
>>>>>>>> already in place then doing full refactorization every so often in
>>>>>>>> batch
>>>>>>>> mode?
>>>>>>>>
>>>>>>>> By 'anonymous users' I mean users with some history that is not yet
>>>>>>>> incorporated in the LF model. This could be history from a new user
>>>>>>>> asked
>>>>>>>> to pick a few items to start the rec process, or an old user with
>>>>>>>> some
>>>>>>>>
>>>>>>>>  new
>>>>>>>
>>>>>>>  action history not yet in the model. Are you going to allow for
>>>>>>>> passing
>>>>>>>>
>>>>>>>>  the
>>>>>>>
>>>>>>>  entire history vector or userID+incremental new history to the
>>>>>>>>
>>>>>>>>  recommender?
>>>>>>>
>>>>>>>  I hope so.
>>>>>>>>
>>>>>>>> For what it's worth we did a comparison of Mahout Item based CF to
>>>>>>>> Mahout
>>>>>>>> ALS-WR CF on 2.5M users and 500K items with many M actions over 6
>>>>>>>> months
>>>>>>>>
>>>>>>>>  of
>>>>>>>
>>>>>>>  data. The data was purchase data from a diverse ecom source with a
>>>>>>>> large
>>>>>>>> variety of products from electronics to clothes. We found Item based
>>>>>>>> CF
>>>>>>>>
>>>>>>>>  did
>>>>>>>
>>>>>>>  far better than ALS. As we increased the number of latent factors
>>>>>>>> the
>>>>>>>> results got better but were never within 10% of item based (we used
>>>>>>>> MAP
>>>>>>>>
>>>>>>>>  as
>>>>>>>
>>>>>>>  the offline metric). Not sure why but maybe it has to do with the
>>>>>>>>
>>>>>>>>  diversity
>>>>>>>
>>>>>>>  of the item types.
>>>>>>>>
>>>>>>>> I understand that a full item based online recommender has very
>>>>>>>> different
>>>>>>>> tradeoffs and anyway others may not have seen this disparity of
>>>>>>>> results.
>>>>>>>> Furthermore we don't have A/B test results yet to validate the
>>>>>>>> offline
>>>>>>>> metric.
>>>>>>>>
>>>>>>>> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Peng,
>>>>>>>>
>>>>>>>> This is the reason I separated out the DataModel, and only put the
>>>>>>>>
>>>>>>>>  learner
>>>>>>>
>>>>>>>  stuff there. The learner I mentioned yesterday just stores the
>>>>>>>> parameters, (noOfUsers+noOfItems)*****noOfLatentFactors, and does
>>>>>>>> not
>>>>>>>>
>>>>>>>> care
>>>>>>>> where preferences are stored.
>>>>>>>>
>>>>>>>> I, kind of, agree with the multi-level DataModel approach:
>>>>>>>> One for iterating over "all" preferences, one for if one wants to
>>>>>>>> deploy
>>>>>>>>
>>>>>>>>  a
>>>>>>>
>>>>>>>  recommender and perform a lot of top-N recommendation tasks.
>>>>>>>>
>>>>>>>> (Or one DataModel with a strategy that might reduce existing memory
>>>>>>>> consumption, while still providing fast access, I am not sure. Let
>>>>>>>> me
>>>>>>>>
>>>>>>>>  try a
>>>>>>>
>>>>>>>  matrix-backed DataModel approach)
>>>>>>>>
>>>>>>>> Gokhan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <[email protected]
>>>>>>>> >
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   I completely agree, Netflix is less than one gigabye in a smart
>>>>>>>>
>>>>>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>>>>>> FactorizablePreferences allow a much more memory efficient
>>>>>>>>>
>>>>>>>>>  representation,
>>>>>>>>
>>>>>>>>  tested on KDD Music dataset which is approx 2.5 times Netflix and
>>>>>>>>> fits
>>>>>>>>>
>>>>>>>>>  into
>>>>>>>>
>>>>>>>>  3GB with that approach.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2013/7/16 Ted Dunning <[email protected]>
>>>>>>>>>
>>>>>>>>>   Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>>>>>
>>>>>>>>>> Note also that this is before you have done any work.
>>>>>>>>>>
>>>>>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <[email protected]
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>  The second idea is indeed splendid, we should separate
>>>>>>>>>>
>>>>>>>>>>> time-complexity
>>>>>>>>>>> first and space-complexity first implementation. What I'm not
>>>>>>>>>>> quite
>>>>>>>>>>>
>>>>>>>>>>>  sure,
>>>>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>>>>>
>>>>>>>>>>> Personally, I think 12G heap space is not that high right? Most
>>>>>>>>>>> new
>>>>>>>>>>>
>>>>>>>>>>>  laptop
>>>>>>>>>>
>>>>>>>>>>  can already handle that (emphasis on laptop). And if we replace
>>>>>>>>>>> hash
>>>>>>>>>>>
>>>>>>>>>>>  map
>>>>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>>>>>> would
>>>>>>>>>>
>>>>>>>>>   simply degrade time complexity for a linear search to O(n), not
>>>>>>>> too
>>>>>>>>
>>>>>>>>> bad
>>>>>>>>>>
>>>>>>>>>   either. The current DataModel is a result of careful thoughts
>>>>>>>> and has
>>>>>>>>
>>>>>>>>> underwent extensive test, it is easier to expand on top of it
>>>>>>>>>>> instead
>>>>>>>>>>>
>>>>>>>>>>>  of
>>>>>>>>>> subverting it.
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>
>>>>
>>>
>
>

Reply via email to