[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

Peng Cheng (JIRA) Mon, 22 Jul 2013 17:05:15 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715885#comment-13715885
 ]


Peng Cheng commented on MAHOUT-1286:
------------------------------------

On second thought, hash map is very likely not the culprit for poor memory 
efficiency here, apologies for the misinformation. The double hashing algorithm 
in FastByIDMap, as described in Don Knuth's book 'the art of computer 
programming', has a default loadFactor of 1.5, which means the size of array is 
only 1.5 times the number of keys. So theoretically the heap size of 
GenericDataModel should never exceed 3 times the size of 
FactorizablePreferences. I'm still very unclear about FastByIDMap's 
implementation, like how it handles deletion of entries. So I cannot tell if my 
observation on netflix is caused by GC (e.g. construct new arrays too often), 
or deletion, or extra space allocated for timestamp. We probably have to run 
netflix in debug mode to identify the problem.

I'll try to bring up this topic in the next hangout. Please give me some hint 
if you are an expert in those FastMap implementations.
                
> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1286
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.9
>            Reporter: Peng Cheng
>            Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

Reply via email to