I agree with b) and c); haven't used seq2sparse enough to grok a).
On Thu, Feb 27, 2014 at 6:30 PM, Suneel Marthi <suneel_mar...@yahoo.com>wrote: > > With the announcement of http://deeplearning4j.org yesterday which is > various Neural Networks implementations on Hadoop 2/JBlas that had been > talked about in one of the other discussion threads on this mailing list. > Do we wanna duplicate a similar effort in Mahout? > > In addition to what Dmitriy's already outlined below, I may add that one > of the bottlenecks (in my experience) in mahout's processing pipeline is > 'seq2sparse'. > > a) Optimize seq2sparse to handle incremental dictionary tokens > - Support for > Deterministic Finite Automaton to speed up text processing. > - not using StringTuples so much in the tokenization (may result in > some speedup) > - explore using Lucene 4.7 in-memory term dictionaries this may > improve the performance substantially. > > Even better why not use Lucene indices themselves as document > repositories as opposed to what's being done now. > > b) Stabilize the existing Clustering algorithms - except for Simple KMeans > the others have issues once we deviate from the 'Happy Sunday Path' > implementation and lack adequate test coverage. > c) RESTful interfaces for invoking classifiers/clustering. > > > > > > > On Thursday, February 27, 2014 9:10 PM, Dmitriy Lyubimov < > dlie...@gmail.com> wrote: > > If we approach this form purely "marketing" standpoint, i would look at it > from two points: why is Mahout used, and why it is not used. > > Mahout is not used because it is a collection of methods that are fairly > non-uniform in their api, especially embedded api, and generaly has zero > encouragement to be developed on top on and incorporated in yet larger > customizable models. I.e. it lacks semantic explicitness of quick > prototyping, and stitching things together is next to impossible. > > > Yet Mahout is used in spite of the above because it has some pretty unique > solvers in the area of linear algebra and text topical analysis. But I > would dare to say not e.g. because of GLM regressions. > > I personally also use Mahout e.g. in favor of something like breeze because > it has sparse linalg support, both in-core and out-of-core, from the very > beginning and it fits naturally unlike in any other package i ever looked > at, R including btw. > > But i find myself heavily disassembling Mahouts into parts and bolts rather > than exactly how e.g. MIA prescribes it. > > Bottom line here, preliminarily primary issues are ease of use, > embedment/scripting, ease of customization, uniformity of apis. > > (1) Take semantic explicitness and scripting issue. Well i guess that's > where the R part comes from, not because we just want to run R. I would > clear it right away -- i don't support any sort of R integration. And not > really because of lack of trying -- I have created a few R front ends for a > bunch of distributed applications, and also created projects that run R in > the backend (I wrote CrunchR more than year ago which is the same thing for > Crunch as what SparkR is for Spark; and yet-another MR framework running R > in backend; and also tried to run things with HadoopR). And have developed > a pretty strong opinion that R just doesn't mix with distributed > frameworks, mostly because of the performance penalities (and if you loose > $5 per day in performance on a single machine it may be ok, but in 100 > machines one loses $500 a day -- and mid size companies in my experience > are not succeptible to 'let's solve it at any HW cost" doctrine, much as > it is generally believed the other way around. > > Anyway, on R toptic i don't see it as a solution for any sort of > semantically explicit driver and customizer technology. There's neither > demand nor willingness of corporate bosses to go that route. I grew pretty > opinionated on that issue. > > But you don't need R to address semantical explicitness, customization and > ease of integration/scripting. Pragmatically, i see scala and carefully > crafted scala dsl as the underlying mechanism for achieving this. Also, > internally i use scala scripting a lot and it is really easy to build shell > interpreter for it (just like spark builds a customized shell), so one > doesn't even need to compile these things necessarily. > > Bottom line, ideally distributed solver implementation should look more > like matlab than java. And I would measure that goal along the lines of > Evan Sparks' talks (i.e. in lines of code and explicitness needed to script > out a well known method). > > See, you forced my hand to discuss solutions ("how") :) > > > (2) on the issue of minimally supported algorithms. Again, i would not see > mlib as a prototype there.Given enough semantical explicitness, virtually > any data scientist would script out ALS in their sleep. And every second > one would script out weighted ALS (so called "implicit feedback). I view > those algorithms not as a goal but rather as a guinea pig for validating > semantical value of ML environment and apis. I would port stronger solvers > into the new semantic ML environment over Spark rather than trying to cover > the very "basics". > > Pragmatically i would say it would be interesting and pragmatical (for me) > to have LDA/LSA/sparse PCA solvers ported. I would also port all clustering > we have (albeit may be not exactly following the methodology). > > I would be also interested in giving foundation for customized hierarchical > solutions along the lines of RLFM with various customizations including in > particular temporal weighing of inference and customized inference of > informative priors there. Computational Bayesian methods along the lines of > MCEM and MCMC are said to provide a very accurate solutions here.The latter > class of models IMO are much more interesting for practitioners of > recommendations than pure rigid uncustomizable ALS class of models, weighed > or not. At least Deepak Agarwal sounds very convincing in his talks. > > > (3) on the issue of performance, i guess by using Spark bindings dsl you > can't do any worse than mllib. Perhaps we could include also support for > Dense JBlas matrices under hood of Matrix API if of interested. Also i am > hearing using GPU libraries lately is becoming also very popular for > performance reasons, up to 300x lin alg speed ups are reported. There are > some fancy thoughts about cost-based optimization of algeraic expressions > for distributed pipelines, but for the first start I will do just very > simple physical plan substitutions (something like if i directly see A'A as > a part of expression, or if A'B' product has small geometry then of course > i'd rather do (BA)' etc. > > But it has potential to do more while retaining absolute degree of manually > forced execution (thru forced checkpoints). It's just i would stop what i > pragmatically need to script out distributed SSVD at this point. > > (4) but in general i would say the scope of your issues sounds like > something that would close a gap between 0.5 and 1.0 rather than 0.9 and > 1.0. > -d > > > > On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > I would like to start a conversation about where we want Mahout to be for > > 1.0. Let's suspend for the moment the question of how to achieve the > > goals. Instead, let's converge on what we really would like to have > happen > > and after that, let's talk about means that will get us there. > > > > Here are some goals that I think would be good in the area of numerics, > > classifiers and clustering: > > > > - runs with or without Hadoop > > > > - runs with or without map-reduce > > > > - includes (at least), regularized generalized linear models, k-means, > > random forest, distributed random forest, distributed neural networks > > > > - reasonably competitive speed against other implementations including > > graphlab, mlib and R. > > > > - interactive model building > > > > - models can be exported as code or data > > > > - simple programming model > > > > - programmable via Java or R > > > > - runs clustered or not > > > > > > What does everybody think? > > >