[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944773#comment-13944773
 ] 

Sebastian Schelter commented on MAHOUT-1464:
--------------------------------------------

[~pferrel] I planned to test the implementation on the R2 dataset from Yahoo 
Webscope [1] with 700M interactions. I regularly use that in the papers I 
write. You have to do a little paperwork with Yahoo to get access however.

[~dlyubimov] The original implementation also uses matrix squaring. The input 
matrix is row-partitioned and mappers compute outer products of rows, emitting 
the resulting matrices row wise and reducers sum those up. The downsampling 
which is applied beforehand helps with cutting down the costs of that 
multiplicatio


[1] http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

> RowSimilarityJob on Spark
> -------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.9
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>              Labels: performance
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch
>
>
> Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
> here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
> with Mahout Spark DRM DSL so a DRM can be used as input. 
> Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
> RSJ on two inputs to calculate the similarity of rows of one DRM with those 
> of another. This cross-similarity has several applications including 
> cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to