Re: Mahout on Spark

Dmitriy Lyubimov Wed, 26 Mar 2014 10:12:56 -0700

Sure.

@Saikat et al:


Check out the http://mahout.apache.org/users/sparkbindings/home.html "Wanted"
section.

Of course, data frames and vectorization(feature prep) standardization is
very high priority there.
Another high priority is interactive shell /scripting (just like spark
shell). Something very similar in R interactive/script runner mode in
spirit. It is very important.

Re: data frames. Anyone familiar with R, knows what it is. Basically a set
of named columnar vectors (with rows named or enumerated as well). A set of
filtering/modifying DSL expressions similar to R (I haven't really thought
about it at depth). The tricky part here is in-core data frame support of
course, since data frames are based on vectors that go beyond just a real
(double) values we have right now. in R, vector values could be integral,
boolean and character(i.e.string) types as well. If we had an in-core
support for that (or borrowed it from somewhere), the rest would have been
easy -- it is just a matter of semantic elegance. Really, i suggest to look
at R paradigms there, it is a pretty elegant way to work with closures
there.

Of course we could use off-the-shelf stuff such as Map's  to support
something  named, with string values. I don't know at this point. Scala
itself comes a long way to help out here.

As for slides, they are of little interest themselves since they mostly
re-interpret and summarize the working notes pdf in a bit more palatable
way. It is just an opportunity to deliver some content to folks who shy
away from reading docs for some reason *wink wink*. I will put them on the
site after meetup if it is ok.




On Wed, Mar 26, 2014 at 9:09 AM, Saikat Kanjilal <[email protected]>wrote:

> +1, in fact I would be very much indebted if someone (namely Dmitry :) )
> could do a google hangout focused on spark where folks can ask questions
> and learn more, to this end I want to bring up something else, it'd be
> great if mahout itself either through the apache project foundation or
> through committer means have a hadoop cluster to test algorithms, it seems
> like folks have their own cluster to test on but I think it'd be a benefit
> to the community to have a cluster that everyone can leverage.
>
> > Subject: Mahout on Spark
> > From: [email protected]
> > Date: Wed, 26 Mar 2014 09:05:02 -0700
> > To: [email protected]; [email protected]
>
> >
> > New name for a new thread.
> >
> > A lot of the discussion on MAHOUT-1464 has been around integrating that
> feature with the Scala DSL. As Saikat says this is of general interest
> since people seem to agree that this is a good place to integrate efforts.
> >
> > I'm interested in what I think Dmitriy called data frames. Being a
> complete noob on Spark I may have gotten this wrong but let me take a shot
> so he can correct me.
> >
> > There are a lot of problems that require a pipeline. The text input
> pipeline is an example, but almost any input to Mahout requires at least an
> id translation step. What I though Dmitriy was suggesting was that by
> avoiding the disk write + read between steps we might get significant
> speedups. This has many implications, I'm sure.
> >
> > For one I think it means the non-serialized objects are being used by
> multiple parts of the pipeline and so are not subject to "translation".
> >
> > Dmitriy can you explain more? You mentioned a talk you have given, do
> you have slides somewhere or a PDF?
> >
> >
> > On Mar 26, 2014, at 7:15 AM, Ted Dunning <[email protected]> wrote:
> >
> > It would be great to have you.
> >
> >
> > (go ahead and start new threads when appropriate ... better than
> hijacking)
> >
> >
> > On Wed, Mar 26, 2014 at 6:00 AM, Hardik Pandya <[email protected]
> >wrote:
> >
> > > Sorry to hijack the thread,
> > >
> > > this seems like first steps of mahout geeting it to work on spark
> > >
> > > there are similar efforts going on with R+Spark aka Spark R
> > >
> > > not sure if this helpos, played with spark ec2 scripts and it brings up
> > > multinode cluster using mesos and its configurable - willing to
> contribute
> > > donations for mahout-dev
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Mar 23, 2014 at 11:22 PM, Saikat Kanjilal (JIRA) <
> [email protected]
> > >> wrote:
> > >
> > >>
> > >> [
> > >>
> > >
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944710#comment-13944710
> > > ]
> > >>
> > >> Saikat Kanjilal commented on MAHOUT-1464:
> > >> -----------------------------------------
> > >>
> > >> +1 on Andrew's suggestion on using AWS to do this. Andrew is it
> possible
> > >> to have a shared account so mahout contributors can use this, I 'd
> even
> > > be
> > >> willing to chip in donations :) to have a shared AWS account
> > >>
> > >>> RowSimilarityJob on Spark
> > >>> -------------------------
> > >>>
> > >>> Key: MAHOUT-1464
> > >>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > >>> Project: Mahout
> > >>> Issue Type: Improvement
> > >>> Components: Collaborative Filtering
> > >>> Affects Versions: 0.9
> > >>> Environment: hadoop, spark
> > >>> Reporter: Pat Ferrel
> > >>> Labels: performance
> > >>> Fix For: 1.0
> > >>>
> > >>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> > >> MAHOUT-1464.patch
> > >>>
> > >>>
> > >>> Create a version of RowSimilarityJob that runs on Spark. Ssc has a
> > >> prototype here: https://gist.github.com/sscdotopen/8314254. This
> should
> > >> be compatible with Mahout Spark DRM DSL so a DRM can be used as input.
> > >>> Ideally this would extend to cover MAHOUT-1422 which is a feature
> > >> request for RSJ on two inputs to calculate the similarity of rows of
> one
> > >> DRM with those of another. This cross-similarity has several
> applications
> > >> including cross-action recommendations.
> > >>
> > >>
> > >>
> > >> --
> > >> This message was sent by Atlassian JIRA
> > >> (v6.2#6252)
> > >>
> > >
> >
>

Re: Mahout on Spark

Reply via email to