Thanks for starting the conversation, Ted. I'm relatively new to the project though I've been using Mahout for a couple years in production, and am happy to see things move forward in whatever way makes sense.
I think Mahout needs to ship a production-ready version if it's going to be called 1.0, otherwise we ought to call the next release 0.10. In that vein, I think Sean, Dmitriy, and Ted have some good points in that Mahout is still a very rough draft. I think we all have used some portion of Mahout in production and are surprised when we find out how dodgy things are in certain spots when we look around further after learning how our favorite things work. I'd like to see several of the things you mention, Ted, including decoupling from Hadoop and map-reduce where possible, working on the speed competition, exporting to PMML, and clarifying the programming approach. And I'm not sure if this is what Dmitriy meant in his comments (3), but I'd love to be able to do Mathematica-style work in an interactive shell and/or symbolic system where I could do A*B' and it just worked. That would crush everything on the market, though it could be a lot of work to build a DSL that supports it. I also think Dmitriy's (5) for having up-front data assessment stuff is really valuable. I'm building things like that internally at work and I can confirm that there is high demand for it. Along with the up-front pipelining, I'd like back in Mahout is a feature that I think was in there and got removed: shipping results in a web service, without writing your own. So I'd like a free machine-learning library I can count on to make sense when I use the Java/Scala API or command-line programs, take raw data and do the necessary "first whack" at it, prepare vectors for jobs, run jobs, and then build a jar file I can put into Jetty or Tomcat, and bonus points do that "real-time" solr-recommender-style recalculation and results serving. The end-to-end part is where I think Mahout could sprint to the front pack and do well. Best Andrew On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <[email protected]> wrote: > I would like to start a conversation about where we want Mahout to be for > 1.0. Let's suspend for the moment the question of how to achieve the > goals. Instead, let's converge on what we really would like to have happen > and after that, let's talk about means that will get us there. > > Here are some goals that I think would be good in the area of numerics, > classifiers and clustering: > > - runs with or without Hadoop > > - runs with or without map-reduce > > - includes (at least), regularized generalized linear models, k-means, > random forest, distributed random forest, distributed neural networks > > - reasonably competitive speed against other implementations including > graphlab, mlib and R. > > - interactive model building > > - models can be exported as code or data > > - simple programming model > > - programmable via Java or R > > - runs clustered or not > > > What does everybody think? >
