I'm having an interesting twitter conversation with Alan about MAHOUT-106 that should better be moved here.

Alan is currently looking at the port of the pig code and asked why it's so bad to write #user * #items * z values which I guess refers to my jira comment at https://issues.apache.org/jira/browse/MAHOUT-106?focusedCommentId=12872881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12872881

It's bad because in the pig code (and the java port of that) it's not done for the known entries of the matrix only (thus using its sparsity) but for *all* possible entries. That won't scale and is IMHO an incorrect interpretation of the algorithm as Hoffman's paper states that the algorithms complexity is O(zN) with N being the number of observed ratings.

Alan also asked for a more commented version of the code (there is non unfortunately) but I think a lot of the code was written looking at the description of PLSI in "Google News Personalization: Scalable Online Collaborative Filtering" ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf )

--sebastian

Reply via email to