Re: Mahout 1.0 goals

Ted Dunning Thu, 27 Feb 2014 20:40:25 -0800

Yes.  THis is a big and important addition.


On Thu, Feb 27, 2014 at 6:19 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> (5) Another thing i would suggest is to look at feature prep
> standartization -- outlier detection, scaling, hash-tricking etc. etc.
> Again, with abilities to customize, or it would be useless.
>
>
> On Thu, Feb 27, 2014 at 6:08 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> >
> >
> > If we approach this form purely "marketing" standpoint, i would look at
> it
> > from two points: why is Mahout used, and why it is not used.
> >
> > Mahout is not used because it is a collection of methods that are fairly
> > non-uniform in their api, especially embedded api, and generaly has zero
> > encouragement to be developed on top on and incorporated in yet larger
> > customizable models. I.e. it lacks semantic explicitness of quick
> > prototyping, and stitching things together is next to impossible.
> >
> >
> > Yet Mahout is used in spite of the above because it has some pretty
> unique
> > solvers in the area of linear algebra and text topical analysis. But I
> > would dare to say not e.g. because of GLM regressions.
> >
> > I personally also use Mahout e.g. in favor of something like breeze
> > because it has sparse linalg support, both in-core and out-of-core, from
> > the very beginning and it fits naturally unlike in any other package i
> ever
> > looked at, R including btw.
> >
> > But i find myself heavily disassembling Mahouts into parts and bolts
> > rather than exactly how e.g. MIA prescribes it.
> >
> > Bottom line here, preliminarily primary issues are ease of use,
> > embedment/scripting, ease of customization, uniformity of apis.
> >
> > (1) Take semantic explicitness and scripting issue. Well i guess that's
> > where the R part comes from, not because we just want to run R. I would
> > clear it right away -- i don't support any sort of R integration. And not
> > really because of lack of trying -- I have created a few R front ends
> for a
> > bunch of distributed applications, and also created projects that run R
> in
> > the backend (I wrote CrunchR more than year ago which is the same thing
> for
> > Crunch as what SparkR is for Spark; and yet-another MR framework running
> R
> > in backend; and also tried to run things with HadoopR). And have
> developed
> > a pretty strong opinion that R just doesn't mix with distributed
> > frameworks, mostly because of the performance penalities (and if you
> loose
> > $5 per day in performance on a single machine it may be ok, but in 100
> > machines one loses $500 a day -- and mid size companies in my experience
> >  are not succeptible to 'let's solve it at any HW cost" doctrine, much as
> > it is generally believed the other way around.
> >
> > Anyway, on R toptic i don't see it as a solution for any sort of
> > semantically explicit driver and customizer technology. There's neither
> > demand nor willingness of corporate bosses to go that route. I grew
> pretty
> > opinionated on that issue.
> >
> > But you don't need R to address semantical explicitness, customization
> and
> > ease of integration/scripting. Pragmatically, i see scala and carefully
> > crafted scala dsl as the underlying mechanism for achieving this. Also,
> > internally i use scala scripting a lot and it is really easy to build
> shell
> > interpreter for it (just like spark builds a customized shell), so one
> > doesn't even need to compile these things necessarily.
> >
> > Bottom line, ideally distributed solver implementation should look more
> > like matlab than java. And I would measure that goal along the lines of
> > Evan Sparks' talks (i.e. in lines of code and explicitness needed to
> script
> > out a well known method).
> >
> > See, you forced my hand to discuss solutions ("how")  :)
> >
> >
> > (2) on the issue of minimally supported algorithms. Again, i would not
> see
> > mlib as a prototype there.Given enough semantical explicitness, virtually
> > any data scientist would script out ALS in their sleep. And every second
> > one would script out weighted ALS (so called "implicit feedback). I view
> > those algorithms not as a goal but rather as a guinea pig for validating
> > semantical value of ML environment and apis. I would port stronger
> solvers
> > into the new semantic ML environment over Spark rather than trying to
> cover
> > the very "basics".
> >
> > Pragmatically i would say it would be interesting and pragmatical (for
> me)
> > to have LDA/LSA/sparse PCA solvers ported. I would also port all
> clustering
> > we have (albeit may be not exactly following the methodology).
> >
> > I would be also interested in giving foundation for customized
> > hierarchical solutions along the lines of RLFM with various
> customizations
> > including in particular temporal weighing of inference and customized
> > inference of informative priors there. Computational Bayesian methods
> along
> > the lines of MCEM and MCMC are said to provide a very accurate solutions
> > here.The latter class of models IMO are much more interesting for
> > practitioners of recommendations than pure rigid uncustomizable ALS class
> > of models, weighed or not. At least Deepak Agarwal sounds very convincing
> > in his talks.
> >
> >
> > (3) on the issue of performance, i guess by using Spark bindings dsl you
> > can't do any worse than mllib. Perhaps we could include also support for
> > Dense JBlas matrices under hood of Matrix API if of interested. Also i am
> > hearing using GPU libraries lately is becoming also very popular for
> > performance reasons, up to 300x lin alg speed ups are reported. There are
> > some fancy thoughts about cost-based optimization of algeraic expressions
> > for distributed pipelines, but for the first start I will do just very
> > simple physical plan substitutions (something like if i directly see A'A
> as
> > a part of expression, or if A'B' product has small geometry then of
> course
> > i'd rather do (BA)' etc.
> >
> > But it has potential to do more while retaining absolute degree of
> > manually forced execution (thru forced checkpoints). It's just i would
> stop
> > what i pragmatically need to script out distributed SSVD at this point.
> >
> > (4) but in general i would say the scope of your issues sounds like
> > something that would close a gap between 0.5 and 1.0 rather than 0.9 and
> > 1.0.
> > -d
> >
> >
> >
> > On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <ted.dunn...@gmail.com
> >wrote:
> >
> >> I would like to start a conversation about where we want Mahout to be
> for
> >> 1.0.  Let's suspend for the moment the question of how to achieve the
> >> goals.  Instead, let's converge on what we really would like to have
> >> happen
> >> and after that, let's talk about means that will get us there.
> >>
> >> Here are some goals that I think would be good in the area of numerics,
> >> classifiers and clustering:
> >>
> >> - runs with or without Hadoop
> >>
> >> - runs with or without map-reduce
> >>
> >> - includes (at least), regularized generalized linear models, k-means,
> >> random forest, distributed random forest, distributed neural networks
> >>
> >> - reasonably competitive speed against other implementations including
> >> graphlab, mlib and R.
> >>
> >> - interactive model building
> >>
> >> - models can be exported as code or data
> >>
> >> - simple programming model
> >>
> >> - programmable via Java or R
> >>
> >> - runs clustered or not
> >>
> >>
> >> What does everybody think?
> >>
> >
> >
>

Re: Mahout 1.0 goals

Reply via email to