On Apr 7, 2014 12:30 AM, "Sebastian Schelter" <[email protected]> wrote:
>
> I agree that the state of the MR code is something that needs to be
addressed. There have been several attempts to rework/refactor it, but none
of them had a satisfactory result unfortunately.
>
> I'm hearing that there is lack for a coherent vision for the future of
Mahout. Let me suggest a radical one.

I agree this sounds a bit radical. I support every item, however on some of
those the reality is probably will stare very inconveniently in our face if
you really wanted to make this oficial direction for this project:)
>
> - call the next release 0.10 not 1.0, as the latter implies a maturity
which does not reflect the radical changes I'm proposing
>
> - move all the MR code to a new maven module, deprecate it and announce
that we delete it in the release after 0.11

I think we should not deprecate MR stuff as long as it is supported (and
used). Instead, we intentionally keep compatibility with MR DRM (not much
else here) on persistence. I am not sure how quickly we would be able to
replace all essential stuff here. It depends on contribution flow. There's
also an opportunity to think over standards of vectorization and feature
prep approaches that exist there -- i don't think two releases
realistically will cut it to the state of solidly verified by
production-grade case studies.this will require strong data frame semantics
representation among other things.

Data frames  will take some time to develop good semantical concepts for
imo. Two releases may or may not be sufficient time frame for this to
happen, that's why i would leave any concrete specifics of mr things
deprecation out at this point

>
> - make the new DSL the heart of Mahout, aim for the following algorithms
to be implemented in the DSL as a new basis:

I am for it, but it implies moving sweepingly to Scala for user code. At
least that's the only thing that would make sense here.

I for one haven't been using java professionally at all in the past 12
months and have been using Scala professionally for more than 2 years.
However, i would better be prepared for mobs of pitchfork bearing villagers
if that's the only alternative left. Maybe some java translation is still
possible for the logical expressions.

>
> Collaborative Filtering:
>
>  * Cooccurrence-based recommender (work started in MAHOUT-1464)
>  * ALS (work started in MAHOUT-1365)
>
> Clustering:
>
>  * k-Means
>  * Streaming k-Means
>
> Classification:
>
>  * NaiveBayes (work started in MAHOUT-1493)
>  * either Random Forests or an ensemble of SGD classifiers
>
> Dimensionality Reduction / Topic Models
>
>  * SSVD (prototype in trunk)
>  * PCA (prototype in trunk)
>  * LDA
>
>
Ok, those items are certainly not utopic. I think if i have enough time, i
would add even a couple more methods to that list on my behalf. Actually
part of the point is that ML specialists will come and throw methods at us.
If not, then i would assume one of the goals of DSL experiment  failed.

> - integrate Stratosphere / h20 as follows:
>
>  * the Stratosphere guys can choose to implement the physical operators
of the DSL to make our algos run on Stratosphere. If they do, this is great
for Mahout as it allows people to run code on different backends. If they
don't, we don't lose anything.

Itd be cool if we get any concrete handywork from them here. Big +1.
>
>  * a major point in porting the algorithms to the DSL would be to make
the input formats of all algorithms consistent. That would allow h20 to
work off the same inputs the scala DSL.

As you know i am signficantly less optimistic here based on what i know
about h2o programming model, but i would sure love to see that to happen
rather than in the current direction of M-1500.

>
> Let me know what you think.
>
> -s
>
>
>
>
>
>
> On 04/06/2014 05:54 PM, Sean Owen wrote:
>>
>> On Sun, Apr 6, 2014 at 4:16 PM, Andrew Musselman
>> <[email protected]> wrote:
>>>
>>> Seems to me there has been a renewed effort to eat our broccoli, along
with
>>> the other ideas people have been bringing on board.
>>>
>>> What are you proposing to put in the board report?
>>
>>
>> I have not seen significant activity to unify or update the existing
>> code. It's still the same different chunks with different styles,
>> input/output, distributed/not, etc. The doc updates look very
>> positive. To be fair the task of really addressing the technical debt
>> is very large, so even making said dent would be a lot of work. A
>> clean-slate reboot therefore actually seems like a good plan, but
>> that's another question...
>>
>> Concretely, in a board report, I personally would not agree with
>> representing the Spark or H2O work as an agreed future plan or
>> roadmap, right now. Being in the board report makes that impression,
>> as have recent articles/tweets I've seen, so it deserves care. That's
>> why I chimed in, maybe tilting at windmills.
>>
>>  From where I sit with customers, the overall impression is negative
>> among those that have tried to use the code, and usage has gone from
>> few to almost none. I doubt my sample is so different from the whole
>> user population. Much of it is consistency/quality, but some of it's
>> just an interest in non-M/R frameworks.
>>
>> So, I think that current state and set of problems is far more
>> important to acknowledge in a board report than just mentioning some
>> future possibilities, and the latter was the impression I got of the
>> likely content. In fact, it makes the talk about large upcoming
>> possible changes make so much more sense.
>>
>

Reply via email to