[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028760#comment-14028760
 ] 

Pat Ferrel commented on MAHOUT-1464:
------------------------------------

Ok, learned something today.

As to using the Java's x.aggregateColumns it looks like there are distributed 
Spark versions of colSums and the rest. They use Spark accumulators to avoid 
pulling the entire matrix into memory. I followed those models and created 
"colCounts" in MatrixOps and SparkEngine. Then used it instead of colSums.

Cooccurrence now passes tests with non-boolean data.

Scary adding to Dmitriy's code though so I'll invite him to look at it. Added a 
couple tests but I don't see many for SparkEngine.

https://github.com/pferrel/mahout/compare/mahout-1464

Still having problems getting mr-legacy to pass tests spark and match-scala 
pass tests.

> Cooccurrence Analysis on Spark
> ------------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to