[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717659#comment-13717659 ]
Peng Cheng commented on MAHOUT-1286: ------------------------------------ Aye aye, I just did, turns out that instances of PreferenceArray$PreferenceView has taken 1.7G. Quite unexpected right? Thanks a lot for the advice. My next experiment will just use GenericPreference [] directly, there will be no more PreferenceArray. Class Name | Objects | Shallow Heap | Retained Heap ------------------------------------------------------------------------------------------------------------------------------- org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView| 72,237,632 | 1,733,703,168 | >= 1,733,703,168 long[] | 480,199 | 818,209,680 | >= 818,209,680 float[] | 480,190 | 410,563,592 | >= 410,563,592 java.lang.Object[] | 18,230 | 361,525,488 | >= 2,443,647,088 org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray | 480,189 | 15,366,048 | >= 1,237,456,672 java.util.ArrayList | 17,811 | 427,464 | >= 2,092,416,104 char[] | 2,150 | 272,632 | >= 272,632 byte[] | 141 | 54,048 | >= 54,048 java.lang.String | 2,119 | 50,856 | >= 271,920 java.util.concurrent.ConcurrentHashMap$HashEntry | 673 | 21,536 | >= 38,104 java.net.URL | 229 | 14,656 | >= 40,720 java.util.HashMap$Entry | 344 | 11,008 | >= 68,760 ------------------------------------------------------------------------------------------------------------------------------- > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > ------------------------------------------------------------------------------------- > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.9 > Reporter: Peng Cheng > Assignee: Sean Owen > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira