MAHOUT-106

Sebastian Schelter Thu, 25 Nov 2010 02:42:19 -0800

I'm having an interesting twitter conversation with Alan aboutMAHOUT-106 that should better be moved here.

Alan is currently looking at the port of the pig code and asked why it'sso bad to write #user * #items * z values which I guess refers to myjira comment athttps://issues.apache.org/jira/browse/MAHOUT-106?focusedCommentId=12872881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12872881

It's bad because in the pig code (and the java port of that) it's notdone for the known entries of the matrix only (thus using its sparsity)but for *all* possible entries. That won't scale and is IMHO anincorrect interpretation of the algorithm as Hoffman's paper states thatthe algorithms complexity is O(zN) with N being the number of observedratings.

Alan also asked for a more commented version of the code (there is nonunfortunately) but I think a lot of the code was written looking at thedescription of PLSI in "Google News Personalization: Scalable OnlineCollaborative Filtering" (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf)


--sebastian

MAHOUT-106

Reply via email to