IMHO you will always have memory issues if you try to provide constant time random access. Thats why I proposed to created a special memory efficient DataModel for sequential access.
2013/7/23 Peng Cheng (JIRA) <j...@apache.org> > > [ > https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717659#comment-13717659] > > Peng Cheng commented on MAHOUT-1286: > ------------------------------------ > > Aye aye, I just did, turns out that instances of > PreferenceArray$PreferenceView has taken 1.7G. Quite unexpected right? > Thanks a lot for the advice. > My next experiment will just use GenericPreference [] directly, there will > be no more PreferenceArray. > > Class Name > | Objects | Shallow Heap | Retained Heap > > ------------------------------------------------------------------------------------------------------------------------------- > org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView| > 72,237,632 | 1,733,703,168 | >= 1,733,703,168 > long[] > | 480,199 | 818,209,680 | >= 818,209,680 > float[] > | 480,190 | 410,563,592 | >= 410,563,592 > java.lang.Object[] > | 18,230 | 361,525,488 | >= 2,443,647,088 > org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray > | 480,189 | 15,366,048 | >= 1,237,456,672 > java.util.ArrayList > | 17,811 | 427,464 | >= 2,092,416,104 > char[] > | 2,150 | 272,632 | >= 272,632 > byte[] > | 141 | 54,048 | >= 54,048 > java.lang.String > | 2,119 | 50,856 | >= 271,920 > java.util.concurrent.ConcurrentHashMap$HashEntry > | 673 | 21,536 | >= 38,104 > java.net.URL > | 229 | 14,656 | >= 40,720 > java.util.HashMap$Entry > | 344 | 11,008 | >= 68,760 > > ------------------------------------------------------------------------------------------------------------------------------- > > > > Memory-efficient DataModel, supporting fast online updates and > element-wise iteration > > > ------------------------------------------------------------------------------------- > > > > Key: MAHOUT-1286 > > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > > Project: Mahout > > Issue Type: Improvement > > Components: Collaborative Filtering > > Affects Versions: 0.9 > > Reporter: Peng Cheng > > Assignee: Sean Owen > > Original Estimate: 336h > > Remaining Estimate: 336h > > > > Most DataModel implementation in current CF component use hash map to > enable fast 2d indexing and update. This is not memory-efficient for big > data set. e.g. Netflix prize dataset takes 11G heap space as a > FileDataModel. > > Improved implementation of DataModel should use more compact data > structure (like arrays), this can trade a little of time complexity in 2d > indexing for vast improvement in memory efficiency. In addition, any online > recommender or online-to-batch converted recommender will not be affected > by this in training process. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators > For more information on JIRA, see: http://www.atlassian.com/software/jira >