Accuracy and speed are important factors in any experiment because of a
simple fact of life, time is a non-renewable resource. Researchers, data
modelers and data scientists, all have limited time budgets for
experimentation.

R is slow. R has lots of packages. This is not be taken as a causal
inference. R has community of active developers and users but they suffer
it's slow-ness every day. R users are some of our biggest proponents. We
come across as a simple package, embedding in their workflows. And save
time, in orders of magnitude for them. One shouldn't wait for 5-10 hrs for
a data ingest, simple transformation or a regression if it can finish in
5mins or 5secs.

If scaling machine learning algorithms is the vision, then I'd go with a
few high energy contributions/contributors/committers (& team) with same
shared goals and culture. It's not about choosing a generic computing
platform with most buzz. That could leave the community in the same spot as
going with a popular platform a few years ago. Better yet, create an API
abstraction (as Tom mentioned) to pick the best computing engine for the
problem-at-hand. Assume that the big data platform will be heterogeneous.

H2O's vision is direct and simple: scaling machine learning for powering
intelligent applications. Our focus is distributed machine learning and a
fully-featured set of industrial grade algorithms. That's all we do. This
needs systems engineers and thinkers just as much as math and ml. And we
are gaining rapid adoption. Our meetups are typically sold out. H2O has a
fanatical early adopter community of users. Our thoughts from community
standpoint are being integrative and co-existent: Converging projects &
communities that have a similar get-it-done mindset & serve the broader
silent community of users with needs.
It all starts with the end (ml) user experience and how we can make it
better.

thanks, Sri

[1] Slow food is a thing; "slow analytics" is not a thing.
http://thomaswdinsmore.com/2014/02/12/machine-learning-in-hadoop-part-two/
[2] DeepLearning in H2O
http://www.meetup.com/Data-Mining/events/170754582/
http://www.meetup.com/Silicon-Valley-Big-Data-Science/events/170840642/
[3] http://0xdata.com/events



On Sun, Mar 16, 2014 at 9:17 AM, Pat Ferrel <[email protected]> wrote:

> So your Mahout DRM work was targeted for production at your company and
> was working well but other parts of the project fell through and it didn't
> get deployed. Some of it is almost a year old and pretty mature.
>
> --This is very good news.
>
> You are also saying that the integration model you used for Spark would
> probably mostly work for other solver frameworks like Stratosphere but it
> doesn't look appropriate for h2o.
>
> --Good to know
>
> Your last point is that speed is not so much a deciding factor as other
> less tangible things. Your example is R which has 5000 packages and
> counting but is notoriously slow. By that I assume you are saying a speed
> comparison is not nearly as important as other factors, most of which have
> to do with attracting the largest community of users and contributors.
>
> --Here we agree for sure. Getting a faster regression or random forest
> implementation (as long as it takes Mahout formats as input) is great. But
> if it implies that committers move to the platform (h2o) used in these
> implementations then someone must make a case for why it's in the roadmap.
>
> On Mar 14, 2014, at 3:55 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
> Pat, sorry for offtop -- this code is actually about a year old at heart. I
> was using it to run some custom methods back in my company but I had to
> largely reshape it to fit Mahout once i got a permission to contribute. So
> this took a while, but the idea is certainly not new. At least parts of
> this code (e.g. drm serialization) used to run something real at some
> point. Actually initial materialization of this code predates MLI talks
> that i was referring to (at least when i first heard of MLI). Unfortunately
> our experiments with big data solvers currently nowhere close to production
> due to product priorities -- so that was in part why i said, well, let's at
> least make it public if we don't use it.
>
> But you can potentially develop this idea to further optimize and support
> basic data frame operators as well, all while independent of the back.
> Unfortunately, the back has to pass certain programming model maturity
> test, right now that would be Spark, Stratosphere and other Flume-java-like
> models, but i don't think 0xdata in particular, as it stands, passes it.
>
> Another thing is (also used at our office) you can simply write it as a
> driver-script and run in a scala shell akin to R.
>
> The next step would be fire up developers to wright algorithms, I think R
> is closing now on about 5,000 packages. I probably will not miss the truth
> here by much by saying this is exactly because of it being ML environment
> (and certainly not because of its performance -- R is notoriously slow).
>
>
>
>
> On Fri, Mar 14, 2014 at 3:39 PM, Pat Ferrel <[email protected]> wrote:
>
> > Cool, I'm super excited to see RSJ on Spark integrated into the mainline
> > with Dimitriy's  work. I really really hope that it is seen as important
> > and doesn't get stalled by committers being demotivated. I had no idea
> that
> > what I consider the heart of Mahout was so close to being real on Spark.
> >
> > I'm also happy to hear that you are full speed ahead for this Spark work.
> > I obviously got the wrong impression.
> >
> > As to "new contributors who have some interesting capabilities" great, as
> > long as it doesn't end up defocusing people. Old committers are naturally
> > going to wonder where to put their efforts with this proposal. Some may
> > just give up until the dust settles. I'm sure we can agree that that
> would
> > not be good.
> >
> > The question of roadmap is, more than ever, up for discussion. I would
> > just plead one last time that Spark work not be stalled while this is
> > worked out.
> >
> > On Mar 14, 2014, at 1:00 PM, Ted Dunning <[email protected]> wrote:
> >
> >
> > Pat
> >
> > I am not suggesting that we walk away from anything.
> >
> > I am suggesting that we welcome new contributors who have some
> interesting
> > capabilities.
> >
> > I also suggest that those efforts should be made to work well with
> > existing efforts.
> >
> > Sent from my iPhone
> >
> >> On Mar 14, 2014, at 10:58, Pat Ferrel <[email protected]> wrote:
> >>
> >> I think people (including me) have underestimated how much you and
> > Sebastian have done on Spark. Realistically it sounds like we are talking
> > about walking away from that in favor of an unknown.
> >>
> >> 0xdata's community has not been solving the problems I care about. You
> > guys have.
> >
> >
>
>


-- 
ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc
+1-408.316.8192

Reply via email to