Hi Pat, please see my response inline.

Best,
Gokhan


On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> May I ask how you plan to support model updates and 'anonymous' users?
>
> I assume the latent factors model is calculated offline still in batch
> mode, then there are periodic updates? How are the updates handled?


If you are referring to the recommender of discussion here, no, updating
the model can be done with a single preference, using stochastic gradient
descent, by updating the particular user and item factors simultaneously.

Do you plan to require batch model refactorization for any update? Or
> perform some partial update by maybe just transforming new data into the LF
> space already in place then doing full refactorization every so often in
> batch mode?
>
> By 'anonymous users' I mean users with some history that is not yet
> incorporated in the LF model. This could be history from a new user asked
> to pick a few items to start the rec process, or an old user with some new
> action history not yet in the model. Are you going to allow for passing the
> entire history vector or userID+incremental new history to the recommender?
> I hope so.


> For what it's worth we did a comparison of Mahout Item based CF to Mahout
> ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of
> data. The data was purchase data from a diverse ecom source with a large
> variety of products from electronics to clothes. We found Item based CF did
> far better than ALS. As we increased the number of latent factors the
> results got better but were never within 10% of item based (we used MAP as
> the offline metric). Not sure why but maybe it has to do with the diversity
> of the item types.
>

My first question, are those actions are only positive, like "purchase" as
you mentioned?


> I understand that a full item based online recommender has very different
> tradeoffs and anyway others may not have seen this disparity of results.
> Furthermore we don't have A/B test results yet to validate the offline
> metric.


I personally think an A/B test is the best way to evaluate a recommender,
and if you will be able to share it, I personally look forward to see the
results. I believe that would be a great contribution for some future
decisions.


> On Jul 16, 2013, at 2:41 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
>
> Peng,
>
> This is the reason I separated out the DataModel, and only put the learner
> stuff there. The learner I mentioned yesterday just stores the
> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
> where preferences are stored.
>
> I, kind of, agree with the multi-level DataModel approach:
> One for iterating over "all" preferences, one for if one wants to deploy a
> recommender and perform a lot of top-N recommendation tasks.
>
> (Or one DataModel with a strategy that might reduce existing memory
> consumption, while still providing fast access, I am not sure. Let me try a
> matrix-backed DataModel approach)
>
> Gokhan
>
>
> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <s...@apache.org>
> wrote:
>
> > I completely agree, Netflix is less than one gigabye in a smart
> > representation, 12x more memory is a nogo. The techniques used in
> > FactorizablePreferences allow a much more memory efficient
> representation,
> > tested on KDD Music dataset which is approx 2.5 times Netflix and fits
> into
> > 3GB with that approach.
> >
> >
> > 2013/7/16 Ted Dunning <ted.dunn...@gmail.com>
> >
> >> Netflix is a small dataset.  12G for that seems quite excessive.
> >>
> >> Note also that this is before you have done any work.
> >>
> >> Ideally, 100million observations should take << 1GB.
> >>
> >> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
> > wrote:
> >>
> >>> The second idea is indeed splendid, we should separate time-complexity
> >>> first and space-complexity first implementation. What I'm not quite
> > sure,
> >>> is that if we really need to create two interfaces instead of one.
> >>> Personally, I think 12G heap space is not that high right? Most new
> >> laptop
> >>> can already handle that (emphasis on laptop). And if we replace hash
> > map
> >>> (the culprit of high memory consumption) with list/linkedList, it would
> >>> simply degrade time complexity for a linear search to O(n), not too bad
> >>> either. The current DataModel is a result of careful thoughts and has
> >>> underwent extensive test, it is easier to expand on top of it instead
> > of
> >>> subverting it.
> >>
> >
>
>

Reply via email to