[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717659#comment-13717659
]
Peng Cheng commented on MAHOUT-1286:
------------------------------------
Aye aye, I just did, turns out that instances of PreferenceArray$PreferenceView
has taken 1.7G. Quite unexpected right? Thanks a lot for the advice.
My next experiment will just use GenericPreference [] directly, there will be
no more PreferenceArray.
Class Name
| Objects | Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------------------------------------
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView|
72,237,632 | 1,733,703,168 | >= 1,733,703,168
long[]
| 480,199 | 818,209,680 | >= 818,209,680
float[]
| 480,190 | 410,563,592 | >= 410,563,592
java.lang.Object[]
| 18,230 | 361,525,488 | >= 2,443,647,088
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray
| 480,189 | 15,366,048 | >= 1,237,456,672
java.util.ArrayList
| 17,811 | 427,464 | >= 2,092,416,104
char[]
| 2,150 | 272,632 | >= 272,632
byte[]
| 141 | 54,048 | >= 54,048
java.lang.String
| 2,119 | 50,856 | >= 271,920
java.util.concurrent.ConcurrentHashMap$HashEntry
| 673 | 21,536 | >= 38,104
java.net.URL
| 229 | 14,656 | >= 40,720
java.util.HashMap$Entry
| 344 | 11,008 | >= 68,760
-------------------------------------------------------------------------------------------------------------------------------
> Memory-efficient DataModel, supporting fast online updates and element-wise
> iteration
> -------------------------------------------------------------------------------------
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.9
> Reporter: Peng Cheng
> Assignee: Sean Owen
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable
> fast 2d indexing and update. This is not memory-efficient for big data set.
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure
> (like arrays), this can trade a little of time complexity in 2d indexing for
> vast improvement in memory efficiency. In addition, any online recommender or
> online-to-batch converted recommender will not be affected by this in
> training process.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira