To set expectations appropriately, I think it's important to point out this is completely infeasible short of a total rewrite, and I can't imagine that will happen. It may not be obvious if you haven't looked at the code how completely dependent on M/R it is.
You can swap out M/R and Spark if you write in terms of something like Crunch, but that is not at all the case here. On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <jayunit...@gmail.com> wrote: > +100 for this, different execution engines, like the direction pig and > crunch take > > Sent from my iPhone > >> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <gkhn...@gmail.com> wrote: >> >> I imagine in Mahout offering an option to the users to select from >> different execution engines (just like we currently do by giving M/R or >> sequential options), and starting from Spark. I am not sure what changes >> needed in the codebase, though. Maybe following MLI (or alike) and >> implementing some more stuff, such as common interfaces for iterating over >> data (the M/R way and the Spark way). >> >> IMO, another effort might be porting pre-online machine learning (such >> transforming text into vector based on the dictionary generated by >> seq2sparse before), machine learning based on mini-batches, and streaming >> summarization stuff in Mahout to Spark-Streaming. >> >> Best, >> Gokhan >> >> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <dlie...@gmail.com>wrote: >> >>> PS I am moving along cost optimizer for spark-backed DRMs on some >>> multiplicative pipelines that is capable of figuring different cost-based >>> rewrites and R-Like DSL that mixes in-core and distributed matrix >>> representations and blocks but it is painfully slow, i really only doing it >>> like couple nights in a month. It does not look like i will be doing it on >>> company time any time soon (and even if i did, the company doesn't seem to >>> be inclined to contribute anything I do anything new on their time). It is >>> all painfully slow, there's no direct funding for it anywhere with no >>> string attached. That probably will be primary reason why Mahout would not >>> be able to get much traction compared to university-based contributions. >>> >>> >>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <dlie...@gmail.com >>>> wrote: >>> >>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge >>>> seem very unlikely due to vastly diverged approach to the basics of >>> linear >>>> algebra (and other things). Just like one cannot grow single tree out of >>>> two trunks -- not easily, anyway. >>>> >>>> It is fairly easy to port (and subsequently beat) MLib at this point from >>>> collection of algorithms point of view. But IMO goal should be more >>>> MLI-like first, and port second. And be very careful with concepts. >>>> Something that i so far don't see happening with MLib. MLib seems to be >>>> old-style Mahout-like rush to become a collection of basic algorithms >>>> rather than coherent foundation. Admittedly, i havent looked very >>> closely. >>>> >>>> >>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <s...@apache.org >>>> wrote: >>>> >>>>> I'm also convinced that Spark is a superior platform for executing >>>>> distributed ML algorithms. We've had a discussion about a change from >>>>> Hadoop to another platform some time ago, but at that point in time it >>> was >>>>> not clear which of the upcoming dataflow processing systems (Spark, >>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me >>> it >>>>> seems pretty obvious that Spark made the race. >>>>> >>>>> I concur with Ted, it would be great to have the communities work >>>>> together. I know that at least 4 mahout committers (including me) are >>>>> already following Spark's mailinglist and actively participating in the >>>>> discussions. >>>>> >>>>> What are the ideas how a fruitful cooperation look like? >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> PS: >>>>> >>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation) >>>>> to Spark some time ago, but I haven't had time to test my code on a >>> large >>>>> dataset yet. I'd be happy to see someone help with that. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote: >>>>>> >>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of >>>>>> doing certain things, but we'd welcome as many Mahout devs as possible >>> to >>>>>> work together. >>>>>> >>>>>> >>>>>> It may be too late, but perhaps a GSoC project to look at a port of >>> some >>>>>> stuff like co occurrence recommender and streaming k-means? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> N >>>>>> -- >>>>>> Sent from Mailbox for iPhone >>>>>> >>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <ted.dunn...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath < >>>>>>> nick.pentre...@gmail.com>wrote: >>>>>>> >>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform >>>>>>>> overall >>>>>>>> for ML. If the two communities can work together to leverage the >>>>>>>> strengths >>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as >>> the >>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be >>>>>>>> achieved! >>>>>>>> >>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for >>> ML >>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of >>> things >>>>>>> and >>>>>>> Spark was intentionally built to support machine learning. >>>>>>> Given that Spark has been announced by a majority of the Hadoop-based >>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in. >>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout) >>>>>>> could >>>>>>> work more closely together. There is a lot of good to be had on both >>>>>>> sides. >>>