Mahout on Spark

Pat Ferrel Wed, 26 Mar 2014 09:05:42 -0700

New name for a new thread.

A lot of the discussion on MAHOUT-1464 has been around integrating that feature 
with the Scala DSL. As Saikat says this is of general interest since people 
seem to agree that this is a good place to integrate efforts.


I’m interested in what I think Dmitriy called data frames. Being a complete 
noob on Spark I may have gotten this wrong but let me take a shot so he can 
correct me.

There are a lot of problems that require a pipeline. The text input pipeline is 
an example, but almost any input to Mahout requires at least an id translation 
step. What I though Dmitriy was suggesting was that by avoiding the disk write 
+ read between steps we might get significant speedups. This has many 
implications, I’m sure.

For one I think it means the non-serialized objects are being used by multiple 
parts of the pipeline and so are not subject to “translation”.

Dmitriy can you explain more? You mentioned a talk you have given, do you have 
slides somewhere or a PDF?


On Mar 26, 2014, at 7:15 AM, Ted Dunning <[email protected]> wrote:

It would be great to have you.


(go ahead and start new threads when appropriate ... better than hijacking)


On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <[email protected]>wrote:

> Sorry to hijack the thread,
> 
> this seems like first steps of mahout geeting it to work on spark
> 
> there are similar efforts going on with R+Spark aka Spark R
> 
> not sure if this helpos, played with spark ec2 scripts and it brings up
> multinode cluster using mesos and its configurable - willing to contribute
> donations for mahout-dev
> 
> 
> 
> 
> 
> On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <[email protected]
>> wrote:
> 
>> 
>>    [
>> 
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710
> ]
>> 
>> Saikat Kanjilal commented on MAHOUT-1464:
>> -----------------------------------------
>> 
>> +1 on Andrew's suggestion on using AWS to do this.  Andrew is it possible
>> to have a shared account so mahout contributors can use this, I 'd even
> be
>> willing to chip in donations :) to have a shared AWS account
>> 
>>> RowSimilarityJob on Spark
>>> -------------------------
>>> 
>>>                Key: MAHOUT-1464
>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>>            Project: Mahout
>>>         Issue Type: Improvement
>>>         Components: Collaborative Filtering
>>>   Affects Versions: 0.9
>>>        Environment: hadoop, spark
>>>           Reporter: Pat Ferrel
>>>             Labels: performance
>>>            Fix For: 1.0
>>> 
>>>        Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch
>>> 
>>> 
>>> Create a version of RowSimilarityJob that runs on Spark. Ssc has a
>> prototype here: https://gist.github.com/sscdotopen/8314254. This should
>> be compatible with Mahout Spark DRM DSL so a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422 which is a feature
>> request for RSJ on two inputs to calculate the similarity of rows of one
>> DRM with those of another. This cross-similarity has several applications
>> including cross-action recommendations.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>> 
>

Mahout on Spark

Reply via email to