I'm having an interesting twitter conversation with Alan about
MAHOUT-106 that should better be moved here.
Alan is currently looking at the port of the pig code and asked why it's
so bad to write #user * #items * z values which I guess refers to my
jira comment at
https://issues.apache.org/jira/browse/MAHOUT-106?focusedCommentId=12872881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12872881
It's bad because in the pig code (and the java port of that) it's not
done for the known entries of the matrix only (thus using its sparsity)
but for *all* possible entries. That won't scale and is IMHO an
incorrect interpretation of the algorithm as Hoffman's paper states that
the algorithms complexity is O(zN) with N being the number of observed
ratings.
Alan also asked for a more commented version of the code (there is non
unfortunately) but I think a lot of the code was written looking at the
description of PLSI in "Google News Personalization: Scalable Online
Collaborative Filtering" (
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf
)
--sebastian
- MAHOUT-106 Sebastian Schelter
-