[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737267#comment-13737267 ]
Gokhan Capan commented on MAHOUT-1286: -------------------------------------- Peng, With a SparseRowMatrix, column access (getPreferencesForItem), but row access is pretty fast (getPreferencesFromUsers). I agree with all other problems you mentioned. In Mahout's SVD-based recommenders and FactorizablePreferences, while computing top-N recommendations, I believe we compute <activeUser,item> predictions for each item, and return the top-N. So basically, a SVD based recommender needs fast access to the rows of the matrix, but not the columns (It still needs to iterate over item ids, though). It is only needed in an item-based recommender, or if a CandidateItemsStrategy is used. In my tests for Netflix data, I saw a 3G heap, too. Let me compare this particular approach with the SparseRowMatrix backed one. I will investigate your approach further. Ted, Additionally, I recently implemented a read-only SolrMatrix, which might be beneficial while implementing the SolrRecommender, if we want to use existing mahout library for similarities etc. I will open a new thread for that. Best > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > ------------------------------------------------------------------------------------- > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.9 > Reporter: Peng Cheng > Assignee: Sean Owen > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira