Yeah, setPreference() and removePreference() shouldn't be there, but injecting Recommender back to DataModel is kind of a strong dependency, which may intermingle components for different concerns. Maybe we can do something to RefreshHelper class? e.g. push something into a swap field so the downstream of a refreshable chain can read it out. I have read Gokhan's UpdateAwareDataModel, and feel that it's probably too heavyweight for a model selector as every time he change the algorithm he has to re-register that.

The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.

All the best,
Yours Peng

On 13-07-16 01:05 AM, Sebastian Schelter wrote:
Hi Gokhan,

I like your proposals and I think this is an important discussion. Peng
is also interested in working on online recommenders, so we should try
to team up our efforts. I'd like to extend the discussion a little to
related API changes, that I think are necessary.

What do you think about completely removing the setPreference() and
removePreference() methods from Recommender? I think they don't belong
there for two reasons: First,  they duplicate functionality from
DataModel and second, a lot of recommenders are read-only/train-once and
cannot handle single preference updates anyway.

I think we should have a DataModel implementation that can be updated
and an online learning recommender should be able to register to be
notified with updates.

We should further more split up the DataModel interface into a hierarchy
of three parts:

First, a simple readonly interface that allows sequential access to the
data (similar to FactorizablePreferences). This allows us to create
memory efficient implementations. E.g. Cheng reported in MAHOUT-1272
that the current DataModel needs 12GB heap for the Netflix dataset (100M
ratings) which is unacceptable. I was able to fit the KDD Music dataset
(250M ratings) into 3GB with FactorizablePreferences.

The second interface would extend the readonly interface and should
resemble what DataModel is today: An easy-to-use in-memory
implementation that trades high memory consumption for convenient random
access.

And finally the third interface would extend the second and provide
tooling for online updates of the data.

What do you think of that? Does it sound reasonable?

--sebastian


The DataModel I imagine would follow the current API, where underlying
preference storage is replaced with a matrix.

A Recommender would then use the DataModel and the OnlineLearner, where
Recommender#setPreference is delegated to DataModel#setPreference (like it
does now), and DataModel#setPreference triggers OnlineLearner#train.





Reply via email to