Ok, I tested the MatrixBackedDataModel, and the heap size is reduced to 7G for the Netflix Data, still large.
The same history is encoded in 2 SparseRowMatrices, one is row-indexed by users and one is by item. It has serious concurrency issues at several places, though (sets and removes need to be thread-safe). Best Gokhan On Sat, Jul 20, 2013 at 12:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote: > Hi, > > Just one simple question: Is the > org.apache.mahout.math.**BinarySearch.binarySearch() > function an optimized version of Arrays.binarySearch()? If it is not, why > implement it again? > > Yours Peng > > > On 13-07-17 06:31 PM, Sebastian Schelter wrote: > >> You are completely right, the simple interface would only be usable for >> readonly / batch-updatable recommenders. Online recommenders might need >> something different. I tried to widen the discussion here to discuss all >> kinds of API changes in the recommenders that would be necessary in the >> future. >> >> >> >> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au> >> >> One thing that suddenly comes to my mind is that, for a simple interface >>> like FactorizablePreferences, maybe sequential READ in real time is >>> possible, but sequential WRITE in O(1) time is Utopia. Because you need >>> to >>> flush out old preference with same user and item ID (in worst case it >>> could >>> be an interpolation search), otherwise you are permitting a user rating >>> an >>> item twice with different values. Considering how FileDataModel suppose >>> to >>> work (new files flush old files), maybe using the simple interface has >>> less >>> advantages than we used to believe. >>> >>> >>> On 13-07-17 04:58 PM, Sebastian Schelter wrote: >>> >>> Hi Peng, >>>> >>>> I never wanted to discard the old interface, I just wanted to split it >>>> up. >>>> I want to have a simple interface that only supports sequential access >>>> (and >>>> allows for very memory efficient implementions, e.g. by the use of >>>> primitive arrays). DataModel should *extend* this interface and provide >>>> sequential and random access (basically what is already does). >>>> >>>> Than a recommender such as SGD could state that it only needs sequential >>>> access to the preferences and you can either feed it a DataModel (so we >>>> don"t break backwards compatibility) or a memory efficient sequential >>>> access thingy. >>>> >>>> Does that make sense for you? >>>> >>>> >>>> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au> >>>> >>>> I see, OK so we shouldn't use the old implementation. But I mean, the >>>> old >>>> >>>>> interface doesn't have to be discarded. The discrepancy between your >>>>> FactorizablePreferences and DataModel is that, your model supports >>>>> getPreferences(), which returns all preferences as an iterator, and >>>>> DataModel supports a few old functions that returns preferences for an >>>>> individual user or item. >>>>> >>>>> My point is that, it is not hard for each of them to implement what >>>>> they >>>>> lack of: old DataModel can implement getPreferences() just by a a loop >>>>> in >>>>> abstract class. Your new FactorizablePreferences can implement those >>>>> old >>>>> functions by a binary search that takes O(log n) time, or an >>>>> interpolation >>>>> search that takes O(log log n) time in average. So does the online >>>>> update. >>>>> It will just be a matter of different speed and space, but not >>>>> different >>>>> interface standard, we can use old unit tests, old examples, old >>>>> everything. And we will be more flexible in writing ensemble >>>>> recommender. >>>>> >>>>> Just a few thoughts, I'll have to validate the idea first before >>>>> creating >>>>> a new JIRA ticket. >>>>> >>>>> Yours Peng >>>>> >>>>> >>>>> >>>>> On 13-07-16 02:51 PM, Sebastian Schelter wrote: >>>>> >>>>> I completely agree, Netflix is less than one gigabye in a smart >>>>> >>>>>> representation, 12x more memory is a nogo. The techniques used in >>>>>> FactorizablePreferences allow a much more memory efficient >>>>>> representation, >>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits >>>>>> into >>>>>> 3GB with that approach. >>>>>> >>>>>> >>>>>> 2013/7/16 Ted Dunning <ted.dunn...@gmail.com> >>>>>> >>>>>> Netflix is a small dataset. 12G for that seems quite excessive. >>>>>> >>>>>> Note also that this is before you have done any work. >>>>>>> >>>>>>> Ideally, 100million observations should take << 1GB. >>>>>>> >>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au> >>>>>>> wrote: >>>>>>> >>>>>>> The second idea is indeed splendid, we should separate >>>>>>> time-complexity >>>>>>> >>>>>>> first and space-complexity first implementation. What I'm not quite >>>>>>>> sure, >>>>>>>> is that if we really need to create two interfaces instead of one. >>>>>>>> Personally, I think 12G heap space is not that high right? Most new >>>>>>>> >>>>>>>> laptop >>>>>>>> >>>>>>> can already handle that (emphasis on laptop). And if we replace >>>>>>> hash >>>>>>> >>>>>>>> map >>>>>>>> (the culprit of high memory consumption) with list/linkedList, it >>>>>>>> would >>>>>>>> simply degrade time complexity for a linear search to O(n), not too >>>>>>>> bad >>>>>>>> either. The current DataModel is a result of careful thoughts and >>>>>>>> has >>>>>>>> underwent extensive test, it is easier to expand on top of it >>>>>>>> instead >>>>>>>> of >>>>>>>> subverting it. >>>>>>>> >>>>>>>> >>>>>>>> >>> > >