[ Please allow me to introduce myself to everyone, I'm Tom Kraljevic with 
0xdata. ]



Hi Pat,



You raise some good questions here, please let me add to the discussion.



> If Mahout moves to another faster better execution engine is will do so only 
> once in the immediate future.

This conversation has definitely highlighted the potential improvements that 
the next generation of compute platforms can bring to the Mahout community 
(users as well as developers).

Clearly, moving to an in-memory approach (vs Hadoop's filesystem-central 
mapreduce) enables new paths.  One interesting possible challenge in my mind is 
how to take that on in a way that enables many options for the Mahout community 
instead of locking in to one.  People have expressed interest in at least three 
different next-gen platforms here (H2O, Spark, Stratosphere), and the question 
I would have is whether it is possible to take advantage of all of them?  At a 
high level, I would think that Mahout users (and machine learners in general) 
would be well served by having more options.

So I think a really interesting API question/challenge for the Mahout community 
is:  does it make sense to think about an API that considers in-memory 
primitives as an abstraction enabling different engines to be plugged in?



> 0xdata is trying to build a faster BD analytics platform (OLAP), not sparse 
> data machine learning in daily production

H2O is definitely intended for machine learning in daily production.  Both from 
a platform level (in-memory scaling across a cluster), as well as with the 
algorithms we have focused on (GLM, GBM, Random Forest, etc.).  Additionally, 
we have also invested in autogenerating lightweight high-speed scoring models 
for embedding in real-time environments (to distinguish between an 
in-production daily modeling flow vs. an in-production traffic-facing 
prediction engine).


Regarding the question about sparse, let me cut-and-paste part of Cliff's 
earlier email.

      o */A note on sparse data/*: H2O sports about 15 different
        compression schemes under the hood, including ones designed to
        compress sparse data.  We happily import SVMLight without ever
        having the data "blow up" and still fully supporting the
        array-access API, including speed guarantees.

This is actually our second crack at it, and we've put on our roadmap a third 
crack to tackle massively sparse datasets.



Thanks,
Tom



On Mar 14, 2014, at 9:39 AM, Pat Ferrel <[email protected]> wrote:

> Love the architectural discussion but sometimes the real answers can be 
> hidden by minutiae.
> 
> Dimitriy is there enough running on Spark to compare to a DRM implementation 
> on H2O? 0xdata, go ahead and implement DRM on H2O. If “the proof is in the 
> pudding” why not compare?. 
> 
> We really ARE betting Mahout on H2O Ted. I don’t buy your denial. If Mahout 
> moves to another faster better execution engine is will do so only once in 
> the immediate future. The only real alternative to your proposal is a call to 
> action for committers to move Mahout to Spark or other more well known 
> engine. These will realistically never coexist.
> 
> 
> Some other concerns:
> 
> If H2O in only 2x as fast as Mahout on Spark I’d be dubious of adopting an 
> unknown or unproven platform. The fact that it is custom made for BD 
> Analytics is both good and bad. It means that expertise we develop for H2O 
> may not be useful for other parallel computing problems. Also it seems from 
> the docs that the design point for 0xdata is not the same as Mahout. 0xdata 
> is trying to build a faster BD analytics platform (OLAP), not sparse data 
> machine learning in daily production. None of the things I use in Mahout are 
> in 0xdata, I suspect because of this mismatch. It doesn’t mean it wont work 
> but in lieu of the apples to apples comparison mentioned above it does worry 
> me.
> 

Reply via email to