If we approach this form purely "marketing" standpoint, i would look at it from two points: why is Mahout used, and why it is not used.
Mahout is not used because it is a collection of methods that are fairly non-uniform in their api, especially embedded api, and generaly has zero encouragement to be developed on top on and incorporated in yet larger customizable models. I.e. it lacks semantic explicitness of quick prototyping, and stitching things together is next to impossible. Yet Mahout is used in spite of the above because it has some pretty unique solvers in the area of linear algebra and text topical analysis. But I would dare to say not e.g. because of GLM regressions. I personally also use Mahout e.g. in favor of something like breeze because it has sparse linalg support, both in-core and out-of-core, from the very beginning and it fits naturally unlike in any other package i ever looked at, R including btw. But i find myself heavily disassembling Mahouts into parts and bolts rather than exactly how e.g. MIA prescribes it. Bottom line here, preliminarily primary issues are ease of use, embedment/scripting, ease of customization, uniformity of apis. (1) Take semantic explicitness and scripting issue. Well i guess that's where the R part comes from, not because we just want to run R. I would clear it right away -- i don't support any sort of R integration. And not really because of lack of trying -- I have created a few R front ends for a bunch of distributed applications, and also created projects that run R in the backend (I wrote CrunchR more than year ago which is the same thing for Crunch as what SparkR is for Spark; and yet-another MR framework running R in backend; and also tried to run things with HadoopR). And have developed a pretty strong opinion that R just doesn't mix with distributed frameworks, mostly because of the performance penalities (and if you loose $5 per day in performance on a single machine it may be ok, but in 100 machines one loses $500 a day -- and mid size companies in my experience are not succeptible to 'let's solve it at any HW cost" doctrine, much as it is generally believed the other way around. Anyway, on R toptic i don't see it as a solution for any sort of semantically explicit driver and customizer technology. There's neither demand nor willingness of corporate bosses to go that route. I grew pretty opinionated on that issue. But you don't need R to address semantical explicitness, customization and ease of integration/scripting. Pragmatically, i see scala and carefully crafted scala dsl as the underlying mechanism for achieving this. Also, internally i use scala scripting a lot and it is really easy to build shell interpreter for it (just like spark builds a customized shell), so one doesn't even need to compile these things necessarily. Bottom line, ideally distributed solver implementation should look more like matlab than java. And I would measure that goal along the lines of Evan Sparks' talks (i.e. in lines of code and explicitness needed to script out a well known method). See, you forced my hand to discuss solutions ("how") :) (2) on the issue of minimally supported algorithms. Again, i would not see mlib as a prototype there.Given enough semantical explicitness, virtually any data scientist would script out ALS in their sleep. And every second one would script out weighted ALS (so called "implicit feedback). I view those algorithms not as a goal but rather as a guinea pig for validating semantical value of ML environment and apis. I would port stronger solvers into the new semantic ML environment over Spark rather than trying to cover the very "basics". Pragmatically i would say it would be interesting and pragmatical (for me) to have LDA/LSA/sparse PCA solvers ported. I would also port all clustering we have (albeit may be not exactly following the methodology). I would be also interested in giving foundation for customized hierarchical solutions along the lines of RLFM with various customizations including in particular temporal weighing of inference and customized inference of informative priors there. Computational Bayesian methods along the lines of MCEM and MCMC are said to provide a very accurate solutions here.The latter class of models IMO are much more interesting for practitioners of recommendations than pure rigid uncustomizable ALS class of models, weighed or not. At least Deepak Agarwal sounds very convincing in his talks. (3) on the issue of performance, i guess by using Spark bindings dsl you can't do any worse than mllib. Perhaps we could include also support for Dense JBlas matrices under hood of Matrix API if of interested. Also i am hearing using GPU libraries lately is becoming also very popular for performance reasons, up to 300x lin alg speed ups are reported. There are some fancy thoughts about cost-based optimization of algeraic expressions for distributed pipelines, but for the first start I will do just very simple physical plan substitutions (something like if i directly see A'A as a part of expression, or if A'B' product has small geometry then of course i'd rather do (BA)' etc. But it has potential to do more while retaining absolute degree of manually forced execution (thru forced checkpoints). It's just i would stop what i pragmatically need to script out distributed SSVD at this point. (4) but in general i would say the scope of your issues sounds like something that would close a gap between 0.5 and 1.0 rather than 0.9 and 1.0. -d On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > I would like to start a conversation about where we want Mahout to be for > 1.0. Let's suspend for the moment the question of how to achieve the > goals. Instead, let's converge on what we really would like to have happen > and after that, let's talk about means that will get us there. > > Here are some goals that I think would be good in the area of numerics, > classifiers and clustering: > > - runs with or without Hadoop > > - runs with or without map-reduce > > - includes (at least), regularized generalized linear models, k-means, > random forest, distributed random forest, distributed neural networks > > - reasonably competitive speed against other implementations including > graphlab, mlib and R. > > - interactive model building > > - models can be exported as code or data > > - simple programming model > > - programmable via Java or R > > - runs clustered or not > > > What does everybody think? >