Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does).
Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we don"t break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng <[email protected]> > I see, OK so we shouldn't use the old implementation. But I mean, the old > interface doesn't have to be discarded. The discrepancy between your > FactorizablePreferences and DataModel is that, your model supports > getPreferences(), which returns all preferences as an iterator, and > DataModel supports a few old functions that returns preferences for an > individual user or item. > > My point is that, it is not hard for each of them to implement what they > lack of: old DataModel can implement getPreferences() just by a a loop in > abstract class. Your new FactorizablePreferences can implement those old > functions by a binary search that takes O(log n) time, or an interpolation > search that takes O(log log n) time in average. So does the online update. > It will just be a matter of different speed and space, but not different > interface standard, we can use old unit tests, old examples, old > everything. And we will be more flexible in writing ensemble recommender. > > Just a few thoughts, I'll have to validate the idea first before creating > a new JIRA ticket. > > Yours Peng > > > > On 13-07-16 02:51 PM, Sebastian Schelter wrote: > >> I completely agree, Netflix is less than one gigabye in a smart >> representation, 12x more memory is a nogo. The techniques used in >> FactorizablePreferences allow a much more memory efficient representation, >> tested on KDD Music dataset which is approx 2.5 times Netflix and fits >> into >> 3GB with that approach. >> >> >> 2013/7/16 Ted Dunning <[email protected]> >> >> Netflix is a small dataset. 12G for that seems quite excessive. >>> >>> Note also that this is before you have done any work. >>> >>> Ideally, 100million observations should take << 1GB. >>> >>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <[email protected]> >>> wrote: >>> >>> The second idea is indeed splendid, we should separate time-complexity >>>> first and space-complexity first implementation. What I'm not quite >>>> sure, >>>> is that if we really need to create two interfaces instead of one. >>>> Personally, I think 12G heap space is not that high right? Most new >>>> >>> laptop >>> >>>> can already handle that (emphasis on laptop). And if we replace hash map >>>> (the culprit of high memory consumption) with list/linkedList, it would >>>> simply degrade time complexity for a linear search to O(n), not too bad >>>> either. The current DataModel is a result of careful thoughts and has >>>> underwent extensive test, it is easier to expand on top of it instead of >>>> subverting it. >>>> >>> > >
