[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944755#comment-13944755 ]
Dmitriy Lyubimov edited comment on MAHOUT-1464 at 3/24/14 5:10 AM: ------------------------------------------------------------------- bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 values but that seems smallish. Can't say when I'll get to it but it's on my list. If someone can jump in quicker--have at it. @Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for the flops alone. Did your original version also used matrix squaring? How did it fare? Also, since the flops grow power-law w.r.t input size (it is a problem for ssvd, too) we may need to contemplate a technique that creates finer splits for such computations based on input size. It very well may be the case that original hdfs splits may turn out to be too large for adequate load redistribution. Technically, it is extremely simple -- we'd just have to insert a physical operator tweaking RDD splits via "shuffless" coalesce() which also costs nothing in Spark. However, i am not sure what would be sensible API for this -- automatic, semi-automatic cost-based... I guess one brainless thing to do is to parameterize optimizer context with desired parallelism (~cluster task capacity) and have optimizer to insert physical opertors that very # of partitions and do automatic shuffless coalesce if the number is too low any thoughts? was (Author: dlyubimov): bq. Adding 16 cores to my closet's cluster next week. Is there a 'large' dataset you have in mind? I have one with 4000 rows, 75,000 columns and 700,000 values but that seems smallish. Can't say when I'll get to it but it's on my list. If someone can jump in quicker--have at it. @Sebastian, actually matrix squaring is incredibly expensive -- size ^1.5 for the flops alone. Did your original version also used matrix squaring? How did it fare? Also, since the flops grow power-law w.r.t input size (it is a problem for ssvd, too) we may need to contemplate a technique that creates finer splits for such computations based on input size. It very well may be the case that original hdfs splits may turn out to be too large for adequate load redistribution. Technically, it is extremely simple -- we'd just have to insert a physical operator tweaking RDD splits via "shuffless" coalesce() which also costs nothing in Spark. However, i am not sure what would be sensible API for this -- automatic, semi-automatic cost-based... I guess one brainless thing to do is to parameterize drmContext with desired parallelism (~cluster task capacity) and have optimizer to insert physical opertors that very # of partitions and do automatic shuffless coalesce if the number is too low any thoughts? > RowSimilarityJob on Spark > ------------------------- > > Key: MAHOUT-1464 > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Affects Versions: 0.9 > Environment: hadoop, spark > Reporter: Pat Ferrel > Labels: performance > Fix For: 1.0 > > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch > > > Create a version of RowSimilarityJob that runs on Spark. Ssc has a prototype > here: https://gist.github.com/sscdotopen/8314254. This should be compatible > with Mahout Spark DRM DSL so a DRM can be used as input. > Ideally this would extend to cover MAHOUT-1422 which is a feature request for > RSJ on two inputs to calculate the similarity of rows of one DRM with those > of another. This cross-similarity has several applications including > cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)