Re: Mahout on Spark?

Gokhan Capan Wed, 19 Feb 2014 02:21:13 -0800

I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).


IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dlie...@gmail.com>wrote:

> PS I am moving along cost optimizer for spark-backed DRMs on some
> multiplicative pipelines that is capable of figuring different cost-based
> rewrites and R-Like DSL that mixes in-core and distributed matrix
> representations and blocks but it is painfully slow, i really only doing it
> like couple nights in a month. It does not look like i will be doing it on
> company time any time soon (and even if i did, the company doesn't seem to
> be inclined to contribute anything I do anything new on their time). It is
> all painfully slow, there's no direct funding for it anywhere with no
> string attached. That probably will be primary reason why Mahout would not
> be able to get much traction compared to university-based contributions.
>
>
> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlie...@gmail.com
> >wrote:
>
> > Unfortunately methinks the prospects of something like Mahout/MLLib merge
> > seem very unlikely due to vastly diverged approach to the basics of
> linear
> > algebra (and other things). Just like one cannot grow single tree out of
> > two trunks -- not easily, anyway.
> >
> > It is fairly easy to port (and subsequently beat) MLib at this point from
> > collection of algorithms point of view. But IMO goal should be more
> > MLI-like first, and port second. And be very careful with concepts.
> > Something that i so far don't see happening with MLib. MLib seems to be
> > old-style Mahout-like rush to become a collection of basic algorithms
> > rather than coherent foundation. Admittedly, i havent looked very
> closely.
> >
> >
> > On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <s...@apache.org
> >wrote:
> >
> >> I'm also convinced that Spark is a superior platform for executing
> >> distributed ML algorithms. We've had a discussion about a change from
> >> Hadoop to another platform some time ago, but at that point in time it
> was
> >> not clear which of the upcoming dataflow processing systems (Spark,
> >> Hyracks, Stratosphere) would establish itself amongst the users. To me
> it
> >> seems pretty obvious that Spark made the race.
> >>
> >> I concur with Ted, it would be great to have the communities work
> >> together. I know that at least 4 mahout committers (including me) are
> >> already following Spark's mailinglist and actively participating in the
> >> discussions.
> >>
> >> What are the ideas how a fruitful cooperation look like?
> >>
> >> Best,
> >> Sebastian
> >>
> >> PS:
> >>
> >> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
> >> to Spark some time ago, but I haven't had time to test my code on a
> large
> >> dataset yet. I'd be happy to see someone help with that.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
> >>
> >>> I know the Spark/Mllib devs can occasionally be quite set in ways of
> >>> doing certain things, but we'd welcome as many Mahout devs as possible
> to
> >>> work together.
> >>>
> >>>
> >>> It may be too late, but perhaps a GSoC project to look at a port of
> some
> >>> stuff like co occurrence recommender and streaming k-means?
> >>>
> >>>
> >>>
> >>>
> >>> N
> >>> --
> >>> Sent from Mailbox for iPhone
> >>>
> >>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <ted.dunn...@gmail.com>
> >>> wrote:
> >>>
> >>>  On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
> >>>> nick.pentre...@gmail.com>wrote:
> >>>>
> >>>>> My (admittedly heavily biased) view is Spark is a superior platform
> >>>>> overall
> >>>>> for ML. If the two communities can work together to leverage the
> >>>>> strengths
> >>>>> of Spark, and the large amount of good stuff in Mahout (as well as
> the
> >>>>> fantastic depth of experience of Mahout devs) I think a lot can be
> >>>>> achieved!
> >>>>>
> >>>>>  It makes a lot of sense that Spark would be better than Hadoop for
> ML
> >>>> purposes given that Hadoop was intended to do web-crawl kinds of
> things
> >>>> and
> >>>> Spark was intentionally built to support machine learning.
> >>>> Given that Spark has been announced by a majority of the Hadoop-based
> >>>> distribution vendors, it makes sense that maybe Mahout should jump in.
> >>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
> >>>> could
> >>>> work more closely together.  There is a lot of good to be had on both
> >>>> sides.
> >>>>
> >>>
> >>
> >
>

Re: Mahout on Spark?

Reply via email to