May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch mode, 
then there are periodic updates? How are the updates handled? Do you plan to 
require batch model refactorization for any update? Or perform some partial 
update by maybe just transforming new data into the LF space already in place 
then doing full refactorization every so often in batch mode?

By 'anonymous users' I mean users with some history that is not yet 
incorporated in the LF model. This could be history from a new user asked to 
pick a few items to start the rec process, or an old user with some new action 
history not yet in the model. Are you going to allow for passing the entire 
history vector or userID+incremental new history to the recommender? I hope so.

For what it's worth we did a comparison of Mahout Item based CF to Mahout 
ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of 
data. The data was purchase data from a diverse ecom source with a large 
variety of products from electronics to clothes. We found Item based CF did far 
better than ALS. As we increased the number of latent factors the results got 
better but were never within 10% of item based (we used MAP as the offline 
metric). Not sure why but maybe it has to do with the diversity of the item 

I understand that a full item based online recommender has very different 
tradeoffs and anyway others may not have seen this disparity of results. 
Furthermore we don't have A/B test results yet to validate the offline metric.

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over "all" preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)


> I completely agree, Netflix is less than one gigabye in a smart
> representation, 12x more memory is a nogo. The techniques used in
> FactorizablePreferences allow a much more memory efficient representation,
> tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
> 3GB with that approach.
>> Netflix is a small dataset.  12G for that seems quite excessive.
>> Note also that this is before you have done any work.
>> Ideally, 100million observations should take << 1GB.
>>> The second idea is indeed splendid, we should separate time-complexity
>>> first and space-complexity first implementation. What I'm not quite
> sure,
>>> is that if we really need to create two interfaces instead of one.
>>> Personally, I think 12G heap space is not that high right? Most new
>> laptop
>>> can already handle that (emphasis on laptop). And if we replace hash
> map
>>> (the culprit of high memory consumption) with list/linkedList, it would
>>> simply degrade time complexity for a linear search to O(n), not too bad
>>> either. The current DataModel is a result of careful thoughts and has
>>> underwent extensive test, it is easier to expand on top of it instead
> of
>>> subverting it.

