Hi Sean, Answers inline.
On 04/06/2014 11:35 AM, Sean Owen wrote:
I agree it's worth pausing to ask what is going on. Recent tweets and articles I've seen give the impression that the project is somehow moving entirely to Spark (or even Stratosphere?), or, entirely to H20. These are sweeping changes that sound very hard to reconcile.
What is going on is the process of finding the next direction for mahout. This process has started only recently, is still going on and involves talking to people and projects outside of mahout to find means where collaboration might be beneficial. Apache projects ought to be community driven and recent tweets and articles are meant to create attentation and answers from the community with regard to the proposed changes, so that we can validate whether we are going into the right direction.
Reactions have been quite positive so far, there is interest for collaboration from the Spark, H2O and Stratosphere community. And there has been a crowded room with no chairs left at the Hadoop Summit Europe last week, when Ted, Suneel and me gave a short talk describing potential future directions for Mahout and had a lively discussion with the audience for the rest of the time.
What is to be done now is to go through a process of discussion and experimentation.
The reality seems more like: someone wants to add some Spark-based matrix stuff and someone else wants to add some H20-based matrix stuff. These are individually intriguing, and less hard to reconcile, although sound overlapping.
I think there is a big misconception here. It is not the case that "someone wants to add Spark-based matrix stuff". Dmitriy has been working for several months on a scala DSL [1] for distributed linear algebraic operations which allows to write algorithms in a concise, compact and beautiful way. A first prototype of this code is part of the codebase and looks very promising.
The best aspect of this dsl is that it allows to define algorithms on a *logical* level using a set of underlying logical operators. The benefit here is that this allows to abstract away the underlying execution system. Dmitriy already provides a prototypical runtime based on Apache Spark. It should be possible to integrate other systems like Stratosphere [2] by simply providing an implementation of the operators tailored to Stratosphere. In this way, users would be given the choice to run our algorithms on different systems without us having to maintain lots of different algorithm implementations.
But then, it's not clear what happens to the rest of the code base, most of which is not related? Rewriting it seems far out of scope of available effort, and not what anyone is suggesting. I assume deleting it, while coherent, would be too extreme.
This is a point that needs to be discussed. With the latest release, we already deleted over 17,000 lines of code related to rarely used and unmaintained algorithms. If it is feasible to port the remaining distributed algorithms to a new platform depends on whether we can attract enough new faces to the project. That is one of the reasons why we talk to other projects and communities. From my personal experience I can say that implementing an algorithm in the new Scala DSL takes only a fraction of the time it takes to write it using MapReduce :)
Speaking as a downstream consumer now, the de facto plan emerging here seems to be a plan to worsen, not address, the significant inconsistencies and problems in the code already. There would be undistributed, MR1, MR2, Spark, H20 code of differing flavors scattered around. It sounds like a step away from 1.0-readiness at a time when this seems to be advertised as coming soon. In the context of a board report, I would think it's also important to acknowledge this perspective, as it is almost certainly causing the project to be removed from a major ecosystem distributor.
What I see is a lively, community-driven discussion ongoing that has yet to produce a de-facto plan. I urge you and the major ecosystem distributor to participate in this discussion so that we can together produce an outcome that matches our interests.
Best, Sebastian [1] https://mahout.apache.org/users/sparkbindings/home.html [2] http://stratosphere.eu/
