[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717659#comment-13717659]
Peng Cheng commented on MAHOUT-1286:
------------------------------------
Aye aye, I just did, turns out that instances of
PreferenceArray$PreferenceView has taken 1.7G. Quite unexpected right?
Thanks a lot for the advice.
My next experiment will just use GenericPreference [] directly, there will
be no more PreferenceArray.
Class Name
| Objects | Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------------------------------------
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView|
72,237,632 | 1,733,703,168 | >= 1,733,703,168
long[]
| 480,199 | 818,209,680 | >= 818,209,680
float[]
| 480,190 | 410,563,592 | >= 410,563,592
java.lang.Object[]
| 18,230 | 361,525,488 | >= 2,443,647,088
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray
| 480,189 | 15,366,048 | >= 1,237,456,672
java.util.ArrayList
| 17,811 | 427,464 | >= 2,092,416,104
char[]
| 2,150 | 272,632 | >= 272,632
byte[]
| 141 | 54,048 | >= 54,048
java.lang.String
| 2,119 | 50,856 | >= 271,920
java.util.concurrent.ConcurrentHashMap$HashEntry
| 673 | 21,536 | >= 38,104
java.net.URL
| 229 | 14,656 | >= 40,720
java.util.HashMap$Entry
| 344 | 11,008 | >= 68,760
-------------------------------------------------------------------------------------------------------------------------------
Memory-efficient DataModel, supporting fast online updates and
element-wise iteration
-------------------------------------------------------------------------------------
Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Original Estimate: 336h
Remaining Estimate: 336h
Most DataModel implementation in current CF component use hash map to
enable fast 2d indexing and update. This is not memory-efficient for big
data set. e.g. Netflix prize dataset takes 11G heap space as a
FileDataModel.
Improved implementation of DataModel should use more compact data
structure (like arrays), this can trade a little of time complexity in 2d
indexing for vast improvement in memory efficiency. In addition, any online
recommender or online-to-batch converted recommender will not be affected
by this in training process.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira