New name for a new thread. A lot of the discussion on MAHOUT-1464 has been around integrating that feature with the Scala DSL. As Saikat says this is of general interest since people seem to agree that this is a good place to integrate efforts.
I’m interested in what I think Dmitriy called data frames. Being a complete noob on Spark I may have gotten this wrong but let me take a shot so he can correct me. There are a lot of problems that require a pipeline. The text input pipeline is an example, but almost any input to Mahout requires at least an id translation step. What I though Dmitriy was suggesting was that by avoiding the disk write + read between steps we might get significant speedups. This has many implications, I’m sure. For one I think it means the non-serialized objects are being used by multiple parts of the pipeline and so are not subject to “translation”. Dmitriy can you explain more? You mentioned a talk you have given, do you have slides somewhere or a PDF? On Mar 26, 2014, at 7:15 AM, Ted Dunning <[email protected]> wrote: It would be great to have you. (go ahead and start new threads when appropriate ... better than hijacking) On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <[email protected]>wrote: > Sorry to hijack the thread, > > this seems like first steps of mahout geeting it to work on spark > > there are similar efforts going on with R+Spark aka Spark R > > not sure if this helpos, played with spark ec2 scripts and it brings up > multinode cluster using mesos and its configurable - willing to contribute > donations for mahout-dev > > > > > > On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <[email protected] >> wrote: > >> >> [ >> > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710 > ] >> >> Saikat Kanjilal commented on MAHOUT-1464: >> ----------------------------------------- >> >> +1 on Andrew's suggestion on using AWS to do this. Andrew is it possible >> to have a shared account so mahout contributors can use this, I 'd even > be >> willing to chip in donations :) to have a shared AWS account >> >>> RowSimilarityJob on Spark >>> ------------------------- >>> >>> Key: MAHOUT-1464 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464 >>> Project: Mahout >>> Issue Type: Improvement >>> Components: Collaborative Filtering >>> Affects Versions: 0.9 >>> Environment: hadoop, spark >>> Reporter: Pat Ferrel >>> Labels: performance >>> Fix For: 1.0 >>> >>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, >> MAHOUT-1464.patch >>> >>> >>> Create a version of RowSimilarityJob that runs on Spark. Ssc has a >> prototype here: https://gist.github.com/sscdotopen/8314254. This should >> be compatible with Mahout Spark DRM DSL so a DRM can be used as input. >>> Ideally this would extend to cover MAHOUT-1422 which is a feature >> request for RSJ on two inputs to calculate the similarity of rows of one >> DRM with those of another. This cross-similarity has several applications >> including cross-action recommendations. >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v6.2#6252) >> >
