Re: Mahout on Spark?

Sean Owen Wed, 19 Feb 2014 04:53:46 -0800

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.


You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <[email protected]> wrote:
> +100 for this, different execution engines, like the direction  pig and 
> crunch take
>
> Sent from my iPhone
>
>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <[email protected]> wrote:
>>
>> I imagine in Mahout offering an option to the users to select from
>> different execution engines (just like we currently do by giving M/R or
>> sequential options), and starting from Spark. I am not sure what changes
>> needed in the codebase, though. Maybe following MLI (or alike) and
>> implementing some more stuff, such as common interfaces for iterating over
>> data (the M/R way and the Spark way).
>>
>> IMO, another effort might be porting pre-online machine learning (such
>> transforming text into vector based on the dictionary generated by
>> seq2sparse before), machine learning based on mini-batches, and streaming
>> summarization stuff in Mahout to Spark-Streaming.
>>
>> Best,
>> Gokhan
>>
>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <[email protected]>wrote:
>>
>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>> multiplicative pipelines that is capable of figuring different cost-based
>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>> representations and blocks but it is painfully slow, i really only doing it
>>> like couple nights in a month. It does not look like i will be doing it on
>>> company time any time soon (and even if i did, the company doesn't seem to
>>> be inclined to contribute anything I do anything new on their time). It is
>>> all painfully slow, there's no direct funding for it anywhere with no
>>> string attached. That probably will be primary reason why Mahout would not
>>> be able to get much traction compared to university-based contributions.
>>>
>>>
>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <[email protected]
>>>> wrote:
>>>
>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>> seem very unlikely due to vastly diverged approach to the basics of
>>> linear
>>>> algebra (and other things). Just like one cannot grow single tree out of
>>>> two trunks -- not easily, anyway.
>>>>
>>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>>> collection of algorithms point of view. But IMO goal should be more
>>>> MLI-like first, and port second. And be very careful with concepts.
>>>> Something that i so far don't see happening with MLib. MLib seems to be
>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>> rather than coherent foundation. Admittedly, i havent looked very
>>> closely.
>>>>
>>>>
>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <[email protected]
>>>> wrote:
>>>>
>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>> distributed ML algorithms. We've had a discussion about a change from
>>>>> Hadoop to another platform some time ago, but at that point in time it
>>> was
>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>> it
>>>>> seems pretty obvious that Spark made the race.
>>>>>
>>>>> I concur with Ted, it would be great to have the communities work
>>>>> together. I know that at least 4 mahout committers (including me) are
>>>>> already following Spark's mailinglist and actively participating in the
>>>>> discussions.
>>>>>
>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> PS:
>>>>>
>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>> to Spark some time ago, but I haven't had time to test my code on a
>>> large
>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>
>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>>> to
>>>>>> work together.
>>>>>>
>>>>>>
>>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>>> some
>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> N
>>>>>> --
>>>>>> Sent from Mailbox for iPhone
>>>>>>
>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>> [email protected]>wrote:
>>>>>>>
>>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>>> overall
>>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>>> strengths
>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>>> the
>>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>>> achieved!
>>>>>>>>
>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>>> ML
>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>>> things
>>>>>>> and
>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>>> could
>>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>>> sides.
>>>

Re: Mahout on Spark?

Reply via email to