[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941531#comment-13941531
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
------------------------------------------

No, i think blockify is fine. it probably can run a bit faster than it does, 
but oh well. 

And mapblock doesn't trigger it (or, rather, it is evaluated lazily; and if 
previous operator already produced blocks, then blockify is not used). what i 
was saying is along the lines of A'A computation. There's a structure that is 
used to fuse operators, which is sort of "eitherOr" of either DrmRdd or 
BlockifiedDrmRdd type. I can to conclusion that there are operators that are 
absolute pain to implement on blocks, and there are that would be pain to 
implement on row vector bags. But blocks can be presented as row bags via 
viewing, so conversion to blocks happens only if subsequent operator requires 
it. What's more, usually block operator outputs blocks as well and vice versa, 
so realistically blockify happens not so often at all.

Another caveat is that one has to be careful with map blocks with side effects 
on RDD of origin. Even though Spark says all RDDs are immutable, side effects 
will stay visible to parent RDDs if they are cached as MEMORY_ONLY or 
MEMORY_AND_DISK (i.e. without mandatory clone-via-serialization in block 
manager) and then subsequently used as a source again.

> RowSimilarityJob on Spark
> -------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.9
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>              Labels: performance
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch
>
>
> Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
> here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
> with Mahout Spark DRM DSL so a DRM can be used as input. 
> Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
> RSJ on two inputs to calculate the similarity of rows of one DRM with those 
> of another. This cross-similarity has several applications including 
> cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to