Ok, I tested the MatrixBackedDataModel, and the heap size is reduced to 7G
for the Netflix Data, still large.

The same history is encoded in 2 SparseRowMatrices, one is row-indexed by
users and one is by item.

It has serious concurrency issues at several places, though (sets and
removes need to be thread-safe).

Best

Gokhan


On Sat, Jul 20, 2013 at 12:15 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:

> Hi,
>
> Just one simple question: Is the 
> org.apache.mahout.math.**BinarySearch.binarySearch()
> function an optimized version of Arrays.binarySearch()? If it is not, why
> implement it again?
>
> Yours Peng
>
>
> On 13-07-17 06:31 PM, Sebastian Schelter wrote:
>
>> You are completely right, the simple interface would only be usable for
>> readonly / batch-updatable recommenders. Online recommenders might need
>> something different. I tried to widen the discussion here to discuss all
>> kinds of API changes in the recommenders that would be necessary in the
>> future.
>>
>>
>>
>> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au>
>>
>>  One thing that suddenly comes to my mind is that, for a simple interface
>>> like FactorizablePreferences, maybe sequential READ in real time is
>>> possible, but sequential WRITE in O(1) time is Utopia. Because you need
>>> to
>>> flush out old preference with same user and item ID (in worst case it
>>> could
>>> be an interpolation search), otherwise you are permitting a user rating
>>> an
>>> item twice with different values. Considering how FileDataModel suppose
>>> to
>>> work (new files flush old files), maybe using the simple interface has
>>> less
>>> advantages than we used to believe.
>>>
>>>
>>> On 13-07-17 04:58 PM, Sebastian Schelter wrote:
>>>
>>>  Hi Peng,
>>>>
>>>> I never wanted to discard the old interface, I just wanted to split it
>>>> up.
>>>> I want to have a simple interface that only supports sequential access
>>>> (and
>>>> allows for very memory efficient implementions, e.g. by the use of
>>>> primitive arrays). DataModel should *extend* this interface and provide
>>>> sequential and random access (basically what is already does).
>>>>
>>>> Than a recommender such as SGD could state that it only needs sequential
>>>> access to the preferences and you can either feed it a DataModel (so we
>>>> don"t break backwards compatibility) or a memory efficient sequential
>>>> access thingy.
>>>>
>>>> Does that make sense for you?
>>>>
>>>>
>>>> 2013/7/17 Peng Cheng <pc...@uowmail.edu.au>
>>>>
>>>>   I see, OK so we shouldn't use the old implementation. But I mean, the
>>>> old
>>>>
>>>>> interface doesn't have to be discarded. The discrepancy between your
>>>>> FactorizablePreferences and DataModel is that, your model supports
>>>>> getPreferences(), which returns all preferences as an iterator, and
>>>>> DataModel supports a few old functions that returns preferences for an
>>>>> individual user or item.
>>>>>
>>>>> My point is that, it is not hard for each of them to implement what
>>>>> they
>>>>> lack of: old DataModel can implement getPreferences() just by a a loop
>>>>> in
>>>>> abstract class. Your new FactorizablePreferences can implement those
>>>>> old
>>>>> functions by a binary search that takes O(log n) time, or an
>>>>> interpolation
>>>>> search that takes O(log log n) time in average. So does the online
>>>>> update.
>>>>> It will just be a matter of different speed and space, but not
>>>>> different
>>>>> interface standard, we can use old unit tests, old examples, old
>>>>> everything. And we will be more flexible in writing ensemble
>>>>> recommender.
>>>>>
>>>>> Just a few thoughts, I'll have to validate the idea first before
>>>>> creating
>>>>> a new JIRA ticket.
>>>>>
>>>>> Yours Peng
>>>>>
>>>>>
>>>>>
>>>>> On 13-07-16 02:51 PM, Sebastian Schelter wrote:
>>>>>
>>>>>   I completely agree, Netflix is less than one gigabye in a smart
>>>>>
>>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>>> FactorizablePreferences allow a much more memory efficient
>>>>>> representation,
>>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>>>> into
>>>>>> 3GB with that approach.
>>>>>>
>>>>>>
>>>>>> 2013/7/16 Ted Dunning <ted.dunn...@gmail.com>
>>>>>>
>>>>>>    Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>>
>>>>>>  Note also that this is before you have done any work.
>>>>>>>
>>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>>>
>>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
>>>>>>> wrote:
>>>>>>>
>>>>>>>    The second idea is indeed splendid, we should separate
>>>>>>> time-complexity
>>>>>>>
>>>>>>>  first and space-complexity first implementation. What I'm not quite
>>>>>>>> sure,
>>>>>>>> is that if we really need to create two interfaces instead of one.
>>>>>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>>>>>
>>>>>>>>   laptop
>>>>>>>>
>>>>>>>   can already handle that (emphasis on laptop). And if we replace
>>>>>>> hash
>>>>>>>
>>>>>>>> map
>>>>>>>> (the culprit of high memory consumption) with list/linkedList, it
>>>>>>>> would
>>>>>>>> simply degrade time complexity for a linear search to O(n), not too
>>>>>>>> bad
>>>>>>>> either. The current DataModel is a result of careful thoughts and
>>>>>>>> has
>>>>>>>> underwent extensive test, it is easier to expand on top of it
>>>>>>>> instead
>>>>>>>> of
>>>>>>>> subverting it.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>
>
>

Reply via email to