[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944755#comment-13944755
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1464 at 3/24/14 5:10 AM:
-------------------------------------------------------------------

bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' 
dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 
values but that seems smallish. Can't say when I'll get to it but it's on my 
list. If someone can jump in quicker--have at it.

@Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for 
the flops alone. Did your original version also used matrix squaring? How did 
it fare?

Also, since the flops grow power-law w.r.t input size (it is a problem for 
ssvd, too) we may need to contemplate a technique that creates finer splits for 
such computations based on input size. It very well may be the case that 
original hdfs splits may turn out to be too large for adequate load 
redistribution.

Technically, it is extremely simple -- we'd just have to insert a physical 
operator tweaking RDD splits via "shuffless" coalesce() which also costs 
nothing in Spark. However, i am not sure what would be sensible API for this -- 
automatic, semi-automatic cost-based...  

I guess one brainless thing to do is to parameterize optimizer context with 
desired parallelism (~cluster task capacity) and have optimizer to insert 
physical opertors that very # of partitions and do automatic shuffless coalesce 
if the number is too low

any thoughts?



was (Author: dlyubimov):
bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' 
dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 
values but that seems smallish. Can't say when I'll get to it but it's on my 
list. If someone can jump in quicker--have at it.

@Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for 
the flops alone. Did your original version also used matrix squaring? How did 
it fare?

Also, since the flops grow power-law w.r.t input size (it is a problem for 
ssvd, too) we may need to contemplate a technique that creates finer splits for 
such computations based on input size. It very well may be the case that 
original hdfs splits may turn out to be too large for adequate load 
redistribution.

Technically, it is extremely simple -- we'd just have to insert a physical 
operator tweaking RDD splits via "shuffless" coalesce() which also costs 
nothing in Spark. However, i am not sure what would be sensible API for this -- 
automatic, semi-automatic cost-based...  

I guess one brainless thing to do is to parameterize drmContext with desired 
parallelism (~cluster task capacity) and have optimizer to insert physical 
opertors that very # of partitions and do automatic shuffless coalesce if the 
number is too low

any thoughts?


> RowSimilarityJob on Spark
> -------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.9
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>              Labels: performance
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch
>
>
> Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype 
> here: https://gist.github.com/sscdotopen/8314254. This should be compatible 
> with Mahout Spark DRM DSL so a DRM can be used as input. 
> Ideally this would extend to cover MAHOUT-1422 which is a feature request for 
> RSJ on two inputs to calculate the similarity of rows of one DRM with those 
> of another. This cross-similarity has several applications including 
> cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to