I see, OK so we shouldn't use the old implementation. But I mean, the
old interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.
My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop
in abstract class. Your new FactorizablePreferences can implement those
old functions by a binary search that takes O(log n) time, or an
interpolation search that takes O(log log n) time in average. So does
the online update. It will just be a matter of different speed and
space, but not different interface standard, we can use old unit tests,
old examples, old everything. And we will be more flexible in writing
ensemble recommender.
Just a few thoughts, I'll have to validate the idea first before
creating a new JIRA ticket.
Yours Peng
On 13-07-16 02:51 PM, Sebastian Schelter wrote:
I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
3GB with that approach.
2013/7/16 Ted Dunning <ted.dunn...@gmail.com>
Netflix is a small dataset. 12G for that seems quite excessive.
Note also that this is before you have done any work.
Ideally, 100million observations should take << 1GB.
On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:
The second idea is indeed splendid, we should separate time-complexity
first and space-complexity first implementation. What I'm not quite sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new
laptop
can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.