[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742714#comment-13742714
]
Peng Cheng commented on MAHOUT-1286:
------------------------------------
Hi Dr Dunning,
Great appreciation, I watched your speech in Berlin on youtube and finally have
a clue on what is going on here.
If i understand right, the core concept is to use Solr as a sparse matrix
multiplier. So theoretically it can encapsulate any recommendation engine (not
necessarily CF) if the recommendation phase can be cast as linear
multiplication. Co-occurence matrix is one instance, other types of
recommendations are possible, but slightly harder, require multiple queries
sometimes. The following 3 cases should cover most classical CF instances:
1. Item-based CF (result = Sim(A,A)* h, where A is the rating matrix and Sim()
is the item-to-item similarity matrix, between all pairs of items ): this is
the easiest and has already been addressed in your speech: calculate Sim(A,A)
beforehand, import into solr and run query ranked by weighted frequency.
2. User-based CF (result = A^T * Sim(A,h), where Sim() is the user-to-user
similarity vector, between new user and all old users): slightly more complex,
can run the first query on A ranked by the customized similarity function, then
use the result of the first to run the second query on A^T ranked by weighted
frequency.
3. SVD-based CF: no can do if the new user is not known before, AFAIK solr
doesn't have any form of matrix pseudoinversion or optimization function. So
determining new user's projection in the SV subspace is impossible given its
dot with some old items. However, if the user in question is old, or new user
can be merged into the model in real-time. Solr can just look-up its vector in
SV subspace by a full match search.
4. ensemble: obviously another linear operation, can be interpreted by a query
with mixed ranking function or multiple queries. Multi-model recommendation, as
a juxtaposing of rating matrix (A_1 | A_2), was never a problem either using
old style CF or recommendation-as-search.
Judging by the sheer performance and scalabilty of solr, this could potentially
make recommendation-as-search a superior option. However as Gokhan inferred, we
will likely still use old algorithms for training, but solr for recommendation.
So I'm going back to 1274 anyway, by using the posted DataModel as a temporary
glue. It won't be hard for me or anybody else to refactor it for the solr
interface.
-Yours Peng
> Memory-efficient DataModel, supporting fast online updates and element-wise
> iteration
> -------------------------------------------------------------------------------------
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.9
> Reporter: Peng Cheng
> Assignee: Sean Owen
> Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable
> fast 2d indexing and update. This is not memory-efficient for big data set.
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure
> (like arrays), this can trade a little of time complexity in 2d indexing for
> vast improvement in memory efficiency. In addition, any online recommender or
> online-to-batch converted recommender will not be affected by this in
> training process.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira