[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

Peng Cheng (JIRA) Fri, 16 Aug 2013 16:35:16 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742714#comment-13742714
 ]


Peng Cheng commented on MAHOUT-1286:
------------------------------------

Hi Dr Dunning,

Great appreciation, I watched your speech in Berlin on youtube and finally have 
a clue on what is going on here.

If i understand right, the core concept is to use Solr as a sparse matrix 
multiplier. So theoretically it can encapsulate any recommendation engine (not 
necessarily CF) if the recommendation phase can be cast as linear 
multiplication. Co-occurence matrix is one instance, other types of 
recommendations are possible, but slightly harder, require multiple queries 
sometimes. The following 3 cases should cover most classical CF instances:

1. Item-based CF (result = Sim(A,A)* h, where A is the rating matrix and Sim() 
is the item-to-item similarity matrix, between all pairs of items ): this is 
the easiest and has already been addressed in your speech: calculate Sim(A,A) 
beforehand, import into solr and run query ranked by weighted frequency.

2. User-based CF (result = A^T * Sim(A,h), where Sim() is the user-to-user 
similarity vector, between new user and all old users): slightly more complex, 
can run the first query on A ranked by the customized similarity function, then 
use the result of the first to run the second query on A^T ranked by weighted 
frequency.

3. SVD-based CF: no can do if the new user is not known before, AFAIK solr 
doesn't have any form of matrix pseudoinversion or optimization function. So 
determining new user's projection in the SV subspace is impossible given its 
dot with some old items. However, if the user in question is old, or new user 
can be merged into the model in real-time. Solr can just look-up its vector in 
SV subspace by a full match search.

4. ensemble: obviously another linear operation, can be interpreted by a query 
with mixed ranking function or multiple queries. Multi-model recommendation, as 
a juxtaposing of rating matrix (A_1 | A_2), was never a problem either using 
old style CF or recommendation-as-search.

Judging by the sheer performance and scalabilty of solr, this could potentially 
make recommendation-as-search a superior option. However as Gokhan inferred, we 
will likely still use old algorithms for training, but solr for recommendation. 
So I'm going back to 1274 anyway, by using the posted DataModel as a temporary 
glue. It won't be hard for me or anybody else to refactor it for the solr 
interface.

-Yours Peng
                
> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1286
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.9
>            Reporter: Peng Cheng
>            Assignee: Sean Owen
>              Labels: collaborative-filtering, datamodel, patch, recommender
>             Fix For: 0.9
>
>         Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

Reply via email to