[
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944755#comment-13944755
]
Dmitriy Lyubimov commented on MAHOUT-1464:
------------------------------------------
bq. Adding 16 cores to my closet's cluster next week. Is there a 'large'
dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000
values but that seems smallish. Can't say when I'll get to it but it's on my
list. If someone can jump in quicker--have at it.
@Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for
the flops alone. Did your original version also used matrix squaring? How did
it fare?
Also, since the flops grow power-law w.r.t input size (it is a problem for
ssvd, too) we may need to contemplate a technique that creates finer splits for
such computations based on input size. It very well may be the case that
original hdfs splits may turn out to be too large for adequate load
redistribution.
Technically, it is extremely simple -- we'd just have to insert a physical
operator tweaking RDD splits via "shuffless" coalesce() which also costs
nothing in Spark. However, i am not sure what would be sensible API for this --
automatic, semi-automatic cost-based...
I guess one brainless thing to do is to parameterize drmContext with desired
parallelism (~cluster task capacity) and have optimizer to insert physical
opertors that very # of partitions and do automatic shuffless coalesce if the
number is too low
any thoughts?
> RowSimilarityJob on Spark
> -------------------------
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.9
> Environment: hadoop, spark
> Reporter: Pat Ferrel
> Labels: performance
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch
>
>
> Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype
> here: https://gist.github.com/sscdotopen/8314254. This should be compatible
> with Mahout Spark DRM DSL so a DRM can be used as input.
> Ideally this would extend to cover MAHOUT-1422 which is a feature request for
> RSJ on two inputs to calculate the similarity of rows of one DRM with those
> of another. This cross-similarity has several applications including
> cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)