This sounds very exciting and enriching!  Look forward to the developments
on this front.


On Thu, Mar 13, 2014 at 11:16 AM, SriSatish Ambati <[email protected]>wrote:

> We are excited at the possibilities of this convergence.
>
> A fan of Mahout 's vision and how it captured the imagination of machine
> learning enthusiasts over the years..
> (Still fondly recollect Isabel's spirited talk at ApacheCon years ago!)
>
> We found that a real product, hacker and an open source developer culture
> was the need.
> The R community has also been looking for a package that solved distributed
> frames (in-memory) & parallel packages for the algorithms behind.
> Our team has executed on a lots of these inspirations fast & furiously in
> open source over the past two years.
>
> We hope to enrich & fulfill the day-to-day workflows of the Machine
> Learning users world-wide through this. 0xdata built H2O, which is an open
> source community inspired by the vision of Google-scale Machine Learning
> for the Enterprise using existing APIs like R. H2O has been Apache v2,
> https://github.com/0xdata/h2o for quite sometime now.
> http://jira.0xdata.com/ Underlying a fine-grain in-memory parallelism is a
> distributed k/v store (non-blocking hashmap) purpose built for doing
> quantitative queries at scale while keeping consistency [1]
>
> With help of legendary Scientific Advise from Stanford's Rob Tibshirani,
> Trevor Hastie,  Stephen Boyd, a tight-feedback based partnership with
> customers and phenomenal distributed systems & math teams [0],  0xdata
> built some of the fastest and accurate [4] distributed algorithms in-Memory
> - Generalized Logistic Modeling, Gradient Boosting Machine, Random Forest
> http://docs.0xdata.com/resources/algoroadmap.html
>
> For example, our billion row regression takes under 6 seconds on a  48-node
> cluster! These and other math legos are accessible via R, Java, Python,
> JavaScript and Scala. Our R package is headed to CRAN this week & users can
> drive from an R REPL.  http://docs.0xdata.com/Ruser/Rpackage.html
> H2O is also easy to install on or off Hadoop, on or off cloud[3] or on a
> laptop. http://docs.0xdata.com/deployment/hadoop.html
>
> We have been brewing an authentic product focused movement of systems,
> math, ml and data scientists and spreading the word in meetups -
> ( http://0xdata.com/h2o-2/community/ )
>
> The vision is that a convergence with the Mahout community would bring
> fresh infusion of product & hacker culture needed to potentially make
> Mahout NextGen the principal platform for machine learning at scale that
> can run in-memory, on Hadoop, or on Spark. This will be part of the ongoing
> evolution of the Hadoop ecosystem.
> And I'd love to get the grass roots excited, energized and have dev, users
> and founders (Grant & Isabel) corral much needed support and momentum.
>
> Tactically speaking -
> Like all software projects, we have to start small & in expansive spirals
> to converge at each level given the resources at hand. So a call to action
> and JIRAs are upcoming.
> Our earliest integration is going to be along the lines of getting minimal
> artifacts (think mvn) & make the Mahout Matrix API connect over Distributed
> Frames.
>  - SparseMatrix is a real requirement on the road ahead & simple extensions
> to our core Distributed Frame [1] will get us some of the way.
>
> Together we can make this happen.
> Thanks, Ted, Ellen and others for introducing us to the community!
>
> Looking forward, Sri
>
> Reference:
> [0] Team H2O http://0xdata.com/about/
> [1] http://www.infoq.com/presentations/api-memory-analytics
> [2] http://www.slideshare.net/0xdata/
> [3] http://docs.0xdata.com/newuser/ec2.html
> [4] http://test.0xdata.com
>
>
>
>
>
>
> On Wed, Mar 12, 2014 at 6:16 PM, Andrew Musselman <
> [email protected]> wrote:
>
> > Sounds like a large positive step; looking forward to hearing more!
> >
> > > On Mar 12, 2014, at 5:44 PM, Ted Dunning <[email protected]>
> wrote:
> > >
> > > I have been working with a company named 0xdata to help them contribute
> > > some new software to Mahout.  This software will give Mahout the
> ability
> > to
> > > do highly iterative in-memory mathematical computations on a cluster
> or a
> > > single machine. This software also comes with high performance
> > distributed
> > > implementations of k-means, logistic regression, random forest and
> other
> > > algorithms.
> > >
> > > I will be starting a thread about this on the dev list shortly, but I
> > > wanted the PMC members to have a short heads up on what has been
> > happening
> > > now that we have consensus on the 0xdata side of the game.
> > >
> > > I think that this has a major potential to bring in an enormous amount
> of
> > > contributing community to Mahout.  Technically, it will, at a stroke,
> > make
> > > Mahout the highest performing machine learning framework around.
> > >
> > > *Development Roadmap*
> > >
> > > Of the requirements that people have been talking about on the main
> > mailing
> > > list, the following capabilities will be provided by this contribution:
> > >
> > > 1) high performance distributed linear algebra
> > >
> > > 2) basic machine learning codes including logistic regression, other
> > > generalized
> > > linear modeling codes, random forest, clustering
> > >
> > > 3) standard file format parsing system (CSV, Lucene, parquet, other) x
> > >    (continuous, constant, categorical, word-like, text-like)
> > >
> > > 4) standard web-based basic applications for common operations
> > >
> > > 5) language bindings (Java, Scala, R, other)
> > >
> > > 6) interactive + batch use
> > >
> > > 7) common representation/good abstraction over representation
> > >
> > > 8) platform diversity, localhost, with/without ( Hadoop, Yarn, Mesos,
> > EC2,
> > > GCE )
> > >
> > >
> > > *Backstory*
> > >
> > > I was recently approached by the Sri Satish, CEO and co-founder of
> 0xdata
> > > who
> > > wanted to explore whether they could donate some portion of the h2o
> > > framework and technology to Mahout.  I was skeptical since all that I
> had
> > > previously seen was the application level demos for this system and was
> > not
> > > at all familiar with the technology underneath. One of the co-founders
> of
> > > 0xdata, however, is Cliff Click who was one of the co-authors of the
> > server
> > > HotSpot compiler.  That alone made the offer worth examining.
> > >
> > > Over the last few weeks, the technical team of 0xdata has been working
> > with
> > > me to work out whether this contribution would be useful to Mahout.
> > >
> > > My strong conclusion is that the donation, with some associated shim
> work
> > > that 0xdata is committing to doing will satisfy roughly 80% of the
> goals
> > > that have emerged other the last week or so of discussion.  Just as
> > > important, this donation connects Mahout to new communities who are
> very
> > > actively working at the frontiers machine learning which is likely to
> > > inject lots of new blood and excitement into the Mahout community.
>  This
> > > has huge potential outside of Mahout itself as well since having a very
> > > strong technical infrastructure that we can all use across many
> projects
> > > has the potential to have the same sort of impact on machine learning
> > > applications and products that Hadoop has had for file-based parallel
> > > processing.  Coming together on a common platform has the potential to
> > > create markets that would otherwise not exist if we don't have this
> > > commonality.
> > >
> > >
> > > *Technical Underpinnings*
> > >
> > > At the lowest level, the h2o framework provides a way to have named
> > objects
> > > stored in memory across a cluster in directly computable form.  H2o
> also
> > > provides a very fine-grained parallel execution framework that allows
> > > computation to be moved close to the data while maintaining
> computational
> > > efficiency with tasks as small as milliseconds in scale.  Objects live
> on
> > > multiple machines and live until they are explicitly deallocated or
> until
> > > the framework is terminated.
> > >
> > > Additional machines can join the framework, but data isn't
> automatically
> > > balanced, nor is it assumed that failures are handled within the
> > framework.
> > > As might be expected given the background of the authors, some pretty
> > > astounding things are done using JVM magic so coding at this lowest
> level
> > > is remarkably congenial.
> > >
> > > This framework can be deployed as a map-only Hadoop program, or as a
> > bunch
> > > of independent programs which borg together as they come up.
> >  Importantly,
> > > it is trivial to start a single node framework as well for easy
> > development
> > > and testing.
> > >
> > > On top of this lowest level, there are math libraries which implement
> low
> > > level
> > > operations as well as a variety of machine learning algorithms.  These
> > > include
> > > high quality implementations of a variety of machine learning programs
> > > including
> > > generalized linear modeling with binomial logistic regression and good
> > > regularization, linear regression, neural networks, random forests and
> so
> > > on.
> > > There are also parsing codes which will load formatted data in parallel
> > from
> > > persistency layers such as HDFS or conventional files.
> > >
> > > At the level of these learning programs, there are web interfaces which
> > > allow
> > > data elements in the framework to be created, managed and deleted.
> > >
> > > There is also an R binding for h2o which allows programs to access and
> > > manage h2o objects.  Functions defined in an R-like language can be
> > applied
> > > in parallel to
> > > data frames stored in the h2o framework.
> > >
> > > *Proposed Developer User Experience*
> > >
> > > I see several kinds of users.  These include numerical developers
> > (largely
> > > mathematicians), Java or Scala developers (like current Mahout devs),
> and
> > > data
> > > analysts.
> > >
> > > - Local h2o single-node cluster
> > > - Temporary h2o cluster
> > > - Shared h2o cluster
> > >
> > > All of these modes will be facilitated by the proposed development.
> > >
> > > *Complementarity with Other Platforms*
> > >
> > > I view h2o as complementary with Hadoop and Spark because it provides a
> > > solid in-memory execution engine as opposed to a general out-of-core
> > > computation model that other map-reduce engines like Hadoop and Spark
> > > implement or more general dataflow systems like Stratosphere, Tez or
> > Drill.
> > >
> > > Also, h2o provides no persistence but depends on other systems for that
> > > such as NFS, HDFS, NAS or MapR.
> > >
> > > H2o is also nicely complimentary to R in that R can invoke operations
> and
> > > move data to and from h2o very easily.
> > >
> > > *Required Additional Work*
> > >
> > > Sparse matrices
> > > Linear algebra bindings
> > > Class-file magic to allow off-the-cuff function definitions
> >
>
>
>
> --
> ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc
> +1-408.316.8192
>

Reply via email to