[
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941531#comment-13941531
]
Dmitriy Lyubimov commented on MAHOUT-1464:
------------------------------------------
No, i think blockify is fine. it probably can run a bit faster than it does,
but oh well.
And mapblock doesn't trigger it (or, rather, it is evaluated lazily; and if
previous operator already produced blocks, then blockify is not used). what i
was saying is along the lines of A'A computation. There's a structure that is
used to fuse operators, which is sort of "eitherOr" of either DrmRdd or
BlockifiedDrmRdd type. I can to conclusion that there are operators that are
absolute pain to implement on blocks, and there are that would be pain to
implement on row vector bags. But blocks can be presented as row bags via
viewing, so conversion to blocks happens only if subsequent operator requires
it. What's more, usually block operator outputs blocks as well and vice versa,
so realistically blockify happens not so often at all.
Another caveat is that one has to be careful with map blocks with side effects
on RDD of origin. Even though Spark says all RDDs are immutable, side effects
will stay visible to parent RDDs if they are cached as MEMORY_ONLY or
MEMORY_AND_DISK (i.e. without mandatory clone-via-serialization in block
manager) and then subsequently used as a source again.
> RowSimilarityJob on Spark
> -------------------------
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.9
> Environment: hadoop, spark
> Reporter: Pat Ferrel
> Labels: performance
> Fix For: 0.9
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch
>
>
> Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype
> here: https://gist.github.com/sscdotopen/8314254. This should be compatible
> with Mahout Spark DRM DSL so a DRM can be used as input.
> Ideally this would extend to cover MAHOUT-1422 which is a feature request for
> RSJ on two inputs to calculate the similarity of rows of one DRM with those
> of another. This cross-similarity has several applications including
> cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)