That's exactly what I'm trying to do right now :) (I'm testing FastByIDArrayMap), but we probably have more problems than just HashMap, based on the heap dump analysis result, PreferenceArray probably will be our next target. This is awesome, as your FactorizablePreferences didn't use it in the first place.

Yours Peng

On 13-07-23 05:46 PM, Sebastian Schelter wrote:
IMHO you will always have memory issues if you try to provide constant time
random access. Thats why I proposed to created a special memory efficient
DataModel for sequential access.


2013/7/23 Peng Cheng (JIRA) <j...@apache.org>

     [
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717659#comment-13717659]

Peng Cheng commented on MAHOUT-1286:
------------------------------------

Aye aye, I just did, turns out that instances of
PreferenceArray$PreferenceView has taken 1.7G. Quite unexpected right?
Thanks a lot for the advice.
My next experiment will just use GenericPreference [] directly, there will
be no more PreferenceArray.

Class Name
     |    Objects |  Shallow Heap |    Retained Heap

-------------------------------------------------------------------------------------------------------------------------------
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView|
72,237,632 | 1,733,703,168 | >= 1,733,703,168
long[]
     |    480,199 |   818,209,680 |   >= 818,209,680
float[]
      |    480,190 |   410,563,592 |   >= 410,563,592
java.lang.Object[]
     |     18,230 |   361,525,488 | >= 2,443,647,088
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray
     |    480,189 |    15,366,048 | >= 1,237,456,672
java.util.ArrayList
      |     17,811 |       427,464 | >= 2,092,416,104
char[]
     |      2,150 |       272,632 |       >= 272,632
byte[]
     |        141 |        54,048 |        >= 54,048
java.lang.String
     |      2,119 |        50,856 |       >= 271,920
java.util.concurrent.ConcurrentHashMap$HashEntry
     |        673 |        21,536 |        >= 38,104
java.net.URL
     |        229 |        14,656 |        >= 40,720
java.util.HashMap$Entry
      |        344 |        11,008 |        >= 68,760

-------------------------------------------------------------------------------------------------------------------------------


Memory-efficient DataModel, supporting fast online updates and
element-wise iteration
-------------------------------------------------------------------------------------
                 Key: MAHOUT-1286
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
             Project: Mahout
          Issue Type: Improvement
          Components: Collaborative Filtering
    Affects Versions: 0.9
            Reporter: Peng Cheng
            Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

Most DataModel implementation in current CF component use hash map to
enable fast 2d indexing and update. This is not memory-efficient for big
data set. e.g. Netflix prize dataset takes 11G heap space as a
FileDataModel.
Improved implementation of DataModel should use more compact data
structure (like arrays), this can trade a little of time complexity in 2d
indexing for vast improvement in memory efficiency. In addition, any online
recommender or online-to-batch converted recommender will not be affected
by this in training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira



Reply via email to