I imagine in Mahout offering an option to the users to select from different execution engines (just like we currently do by giving M/R or sequential options), and starting from Spark. I am not sure what changes needed in the codebase, though. Maybe following MLI (or alike) and implementing some more stuff, such as common interfaces for iterating over data (the M/R way and the Spark way).
IMO, another effort might be porting pre-online machine learning (such transforming text into vector based on the dictionary generated by seq2sparse before), machine learning based on mini-batches, and streaming summarization stuff in Mahout to Spark-Streaming. Best, Gokhan On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dlie...@gmail.com>wrote: > PS I am moving along cost optimizer for spark-backed DRMs on some > multiplicative pipelines that is capable of figuring different cost-based > rewrites and R-Like DSL that mixes in-core and distributed matrix > representations and blocks but it is painfully slow, i really only doing it > like couple nights in a month. It does not look like i will be doing it on > company time any time soon (and even if i did, the company doesn't seem to > be inclined to contribute anything I do anything new on their time). It is > all painfully slow, there's no direct funding for it anywhere with no > string attached. That probably will be primary reason why Mahout would not > be able to get much traction compared to university-based contributions. > > > On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlie...@gmail.com > >wrote: > > > Unfortunately methinks the prospects of something like Mahout/MLLib merge > > seem very unlikely due to vastly diverged approach to the basics of > linear > > algebra (and other things). Just like one cannot grow single tree out of > > two trunks -- not easily, anyway. > > > > It is fairly easy to port (and subsequently beat) MLib at this point from > > collection of algorithms point of view. But IMO goal should be more > > MLI-like first, and port second. And be very careful with concepts. > > Something that i so far don't see happening with MLib. MLib seems to be > > old-style Mahout-like rush to become a collection of basic algorithms > > rather than coherent foundation. Admittedly, i havent looked very > closely. > > > > > > On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <s...@apache.org > >wrote: > > > >> I'm also convinced that Spark is a superior platform for executing > >> distributed ML algorithms. We've had a discussion about a change from > >> Hadoop to another platform some time ago, but at that point in time it > was > >> not clear which of the upcoming dataflow processing systems (Spark, > >> Hyracks, Stratosphere) would establish itself amongst the users. To me > it > >> seems pretty obvious that Spark made the race. > >> > >> I concur with Ted, it would be great to have the communities work > >> together. I know that at least 4 mahout committers (including me) are > >> already following Spark's mailinglist and actively participating in the > >> discussions. > >> > >> What are the ideas how a fruitful cooperation look like? > >> > >> Best, > >> Sebastian > >> > >> PS: > >> > >> I ported LLR-based cooccurrence analysis (aka item-based recommendation) > >> to Spark some time ago, but I haven't had time to test my code on a > large > >> dataset yet. I'd be happy to see someone help with that. > >> > >> > >> > >> > >> > >> > >> On 02/19/2014 08:04 AM, Nick Pentreath wrote: > >> > >>> I know the Spark/Mllib devs can occasionally be quite set in ways of > >>> doing certain things, but we'd welcome as many Mahout devs as possible > to > >>> work together. > >>> > >>> > >>> It may be too late, but perhaps a GSoC project to look at a port of > some > >>> stuff like co occurrence recommender and streaming k-means? > >>> > >>> > >>> > >>> > >>> N > >>> -- > >>> Sent from Mailbox for iPhone > >>> > >>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <ted.dunn...@gmail.com> > >>> wrote: > >>> > >>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath < > >>>> nick.pentre...@gmail.com>wrote: > >>>> > >>>>> My (admittedly heavily biased) view is Spark is a superior platform > >>>>> overall > >>>>> for ML. If the two communities can work together to leverage the > >>>>> strengths > >>>>> of Spark, and the large amount of good stuff in Mahout (as well as > the > >>>>> fantastic depth of experience of Mahout devs) I think a lot can be > >>>>> achieved! > >>>>> > >>>>> It makes a lot of sense that Spark would be better than Hadoop for > ML > >>>> purposes given that Hadoop was intended to do web-crawl kinds of > things > >>>> and > >>>> Spark was intentionally built to support machine learning. > >>>> Given that Spark has been announced by a majority of the Hadoop-based > >>>> distribution vendors, it makes sense that maybe Mahout should jump in. > >>>> I really would prefer it if the two communities (MLib/MLI and Mahout) > >>>> could > >>>> work more closely together. There is a lot of good to be had on both > >>>> sides. > >>>> > >>> > >> > > >