[ Please allow me to introduce myself to everyone, I'm Tom Kraljevic with
0xdata. ]
Hi Pat,
You raise some good questions here, please let me add to the discussion.
> If Mahout moves to another faster better execution engine is will do so only
> once in the immediate future.
This conversation has definitely highlighted the potential improvements that
the next generation of compute platforms can bring to the Mahout community
(users as well as developers).
Clearly, moving to an in-memory approach (vs Hadoop's filesystem-central
mapreduce) enables new paths. One interesting possible challenge in my mind is
how to take that on in a way that enables many options for the Mahout community
instead of locking in to one. People have expressed interest in at least three
different next-gen platforms here (H2O, Spark, Stratosphere), and the question
I would have is whether it is possible to take advantage of all of them? At a
high level, I would think that Mahout users (and machine learners in general)
would be well served by having more options.
So I think a really interesting API question/challenge for the Mahout community
is: does it make sense to think about an API that considers in-memory
primitives as an abstraction enabling different engines to be plugged in?
> 0xdata is trying to build a faster BD analytics platform (OLAP), not sparse
> data machine learning in daily production
H2O is definitely intended for machine learning in daily production. Both from
a platform level (in-memory scaling across a cluster), as well as with the
algorithms we have focused on (GLM, GBM, Random Forest, etc.). Additionally,
we have also invested in autogenerating lightweight high-speed scoring models
for embedding in real-time environments (to distinguish between an
in-production daily modeling flow vs. an in-production traffic-facing
prediction engine).
Regarding the question about sparse, let me cut-and-paste part of Cliff's
earlier email.
o */A note on sparse data/*: H2O sports about 15 different
compression schemes under the hood, including ones designed to
compress sparse data. We happily import SVMLight without ever
having the data "blow up" and still fully supporting the
array-access API, including speed guarantees.
This is actually our second crack at it, and we've put on our roadmap a third
crack to tackle massively sparse datasets.
Thanks,
Tom
On Mar 14, 2014, at 9:39 AM, Pat Ferrel <[email protected]> wrote:
> Love the architectural discussion but sometimes the real answers can be
> hidden by minutiae.
>
> Dimitriy is there enough running on Spark to compare to a DRM implementation
> on H2O? 0xdata, go ahead and implement DRM on H2O. If “the proof is in the
> pudding” why not compare?.
>
> We really ARE betting Mahout on H2O Ted. I don’t buy your denial. If Mahout
> moves to another faster better execution engine is will do so only once in
> the immediate future. The only real alternative to your proposal is a call to
> action for committers to move Mahout to Spark or other more well known
> engine. These will realistically never coexist.
>
>
> Some other concerns:
>
> If H2O in only 2x as fast as Mahout on Spark I’d be dubious of adopting an
> unknown or unproven platform. The fact that it is custom made for BD
> Analytics is both good and bad. It means that expertise we develop for H2O
> may not be useful for other parallel computing problems. Also it seems from
> the docs that the design point for 0xdata is not the same as Mahout. 0xdata
> is trying to build a faster BD analytics platform (OLAP), not sparse data
> machine learning in daily production. None of the things I use in Mahout are
> in 0xdata, I suspect because of this mismatch. It doesn’t mean it wont work
> but in lieu of the apples to apples comparison mentioned above it does worry
> me.
>