IMHO you will always have memory issues if you try to provide constant time
random access. Thats why I proposed to created a special memory efficient
DataModel for sequential access.


2013/7/23 Peng Cheng (JIRA) <j...@apache.org>

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717659#comment-13717659]
>
> Peng Cheng commented on MAHOUT-1286:
> ------------------------------------
>
> Aye aye, I just did, turns out that instances of
> PreferenceArray$PreferenceView has taken 1.7G. Quite unexpected right?
> Thanks a lot for the advice.
> My next experiment will just use GenericPreference [] directly, there will
> be no more PreferenceArray.
>
> Class Name
>     |    Objects |  Shallow Heap |    Retained Heap
>
> -------------------------------------------------------------------------------------------------------------------------------
> org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView|
> 72,237,632 | 1,733,703,168 | >= 1,733,703,168
> long[]
>     |    480,199 |   818,209,680 |   >= 818,209,680
> float[]
>      |    480,190 |   410,563,592 |   >= 410,563,592
> java.lang.Object[]
>     |     18,230 |   361,525,488 | >= 2,443,647,088
> org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray
>     |    480,189 |    15,366,048 | >= 1,237,456,672
> java.util.ArrayList
>      |     17,811 |       427,464 | >= 2,092,416,104
> char[]
>     |      2,150 |       272,632 |       >= 272,632
> byte[]
>     |        141 |        54,048 |        >= 54,048
> java.lang.String
>     |      2,119 |        50,856 |       >= 271,920
> java.util.concurrent.ConcurrentHashMap$HashEntry
>     |        673 |        21,536 |        >= 38,104
> java.net.URL
>     |        229 |        14,656 |        >= 40,720
> java.util.HashMap$Entry
>      |        344 |        11,008 |        >= 68,760
>
> -------------------------------------------------------------------------------------------------------------------------------
>
>
> > Memory-efficient DataModel, supporting fast online updates and
> element-wise iteration
> >
> -------------------------------------------------------------------------------------
> >
> >                 Key: MAHOUT-1286
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Collaborative Filtering
> >    Affects Versions: 0.9
> >            Reporter: Peng Cheng
> >            Assignee: Sean Owen
> >   Original Estimate: 336h
> >  Remaining Estimate: 336h
> >
> > Most DataModel implementation in current CF component use hash map to
> enable fast 2d indexing and update. This is not memory-efficient for big
> data set. e.g. Netflix prize dataset takes 11G heap space as a
> FileDataModel.
> > Improved implementation of DataModel should use more compact data
> structure (like arrays), this can trade a little of time complexity in 2d
> indexing for vast improvement in memory efficiency. In addition, any online
> recommender or online-to-batch converted recommender will not be affected
> by this in training process.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>

Reply via email to