[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737267#comment-13737267
 ] 

Gokhan Capan commented on MAHOUT-1286:
--------------------------------------

Peng,

With a SparseRowMatrix, column access (getPreferencesForItem), but row access 
is pretty fast (getPreferencesFromUsers). I agree with all other problems you 
mentioned. 

In Mahout's SVD-based recommenders and FactorizablePreferences, while computing 
top-N recommendations, I believe we compute <activeUser,item> predictions for 
each item, and return the top-N. So basically, a SVD based recommender needs 
fast access to the rows of the matrix, but not the columns (It still needs to 
iterate over item ids, though). It is only needed in an item-based recommender, 
or if a CandidateItemsStrategy is used.

In my tests for Netflix data, I saw a 3G heap, too. Let me compare this 
particular approach with the SparseRowMatrix backed one. I will investigate 
your approach further.

Ted, 

Additionally, I recently implemented a read-only SolrMatrix, which might be 
beneficial while implementing the SolrRecommender, if we want to use existing 
mahout library for similarities etc. I will open a new thread for that.

Best

                
> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1286
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.9
>            Reporter: Peng Cheng
>            Assignee: Sean Owen
>              Labels: collaborative-filtering, datamodel, patch, recommender
>             Fix For: 0.9
>
>         Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to