This sounds very exciting and enriching! Look forward to the developments on this front.
On Thu, Mar 13, 2014 at 11:16 AM, SriSatish Ambati <[email protected]>wrote: > We are excited at the possibilities of this convergence. > > A fan of Mahout 's vision and how it captured the imagination of machine > learning enthusiasts over the years.. > (Still fondly recollect Isabel's spirited talk at ApacheCon years ago!) > > We found that a real product, hacker and an open source developer culture > was the need. > The R community has also been looking for a package that solved distributed > frames (in-memory) & parallel packages for the algorithms behind. > Our team has executed on a lots of these inspirations fast & furiously in > open source over the past two years. > > We hope to enrich & fulfill the day-to-day workflows of the Machine > Learning users world-wide through this. 0xdata built H2O, which is an open > source community inspired by the vision of Google-scale Machine Learning > for the Enterprise using existing APIs like R. H2O has been Apache v2, > https://github.com/0xdata/h2o for quite sometime now. > http://jira.0xdata.com/ Underlying a fine-grain in-memory parallelism is a > distributed k/v store (non-blocking hashmap) purpose built for doing > quantitative queries at scale while keeping consistency [1] > > With help of legendary Scientific Advise from Stanford's Rob Tibshirani, > Trevor Hastie, Stephen Boyd, a tight-feedback based partnership with > customers and phenomenal distributed systems & math teams [0], 0xdata > built some of the fastest and accurate [4] distributed algorithms in-Memory > - Generalized Logistic Modeling, Gradient Boosting Machine, Random Forest > http://docs.0xdata.com/resources/algoroadmap.html > > For example, our billion row regression takes under 6 seconds on a 48-node > cluster! These and other math legos are accessible via R, Java, Python, > JavaScript and Scala. Our R package is headed to CRAN this week & users can > drive from an R REPL. http://docs.0xdata.com/Ruser/Rpackage.html > H2O is also easy to install on or off Hadoop, on or off cloud[3] or on a > laptop. http://docs.0xdata.com/deployment/hadoop.html > > We have been brewing an authentic product focused movement of systems, > math, ml and data scientists and spreading the word in meetups - > ( http://0xdata.com/h2o-2/community/ ) > > The vision is that a convergence with the Mahout community would bring > fresh infusion of product & hacker culture needed to potentially make > Mahout NextGen the principal platform for machine learning at scale that > can run in-memory, on Hadoop, or on Spark. This will be part of the ongoing > evolution of the Hadoop ecosystem. > And I'd love to get the grass roots excited, energized and have dev, users > and founders (Grant & Isabel) corral much needed support and momentum. > > Tactically speaking - > Like all software projects, we have to start small & in expansive spirals > to converge at each level given the resources at hand. So a call to action > and JIRAs are upcoming. > Our earliest integration is going to be along the lines of getting minimal > artifacts (think mvn) & make the Mahout Matrix API connect over Distributed > Frames. > - SparseMatrix is a real requirement on the road ahead & simple extensions > to our core Distributed Frame [1] will get us some of the way. > > Together we can make this happen. > Thanks, Ted, Ellen and others for introducing us to the community! > > Looking forward, Sri > > Reference: > [0] Team H2O http://0xdata.com/about/ > [1] http://www.infoq.com/presentations/api-memory-analytics > [2] http://www.slideshare.net/0xdata/ > [3] http://docs.0xdata.com/newuser/ec2.html > [4] http://test.0xdata.com > > > > > > > On Wed, Mar 12, 2014 at 6:16 PM, Andrew Musselman < > [email protected]> wrote: > > > Sounds like a large positive step; looking forward to hearing more! > > > > > On Mar 12, 2014, at 5:44 PM, Ted Dunning <[email protected]> > wrote: > > > > > > I have been working with a company named 0xdata to help them contribute > > > some new software to Mahout. This software will give Mahout the > ability > > to > > > do highly iterative in-memory mathematical computations on a cluster > or a > > > single machine. This software also comes with high performance > > distributed > > > implementations of k-means, logistic regression, random forest and > other > > > algorithms. > > > > > > I will be starting a thread about this on the dev list shortly, but I > > > wanted the PMC members to have a short heads up on what has been > > happening > > > now that we have consensus on the 0xdata side of the game. > > > > > > I think that this has a major potential to bring in an enormous amount > of > > > contributing community to Mahout. Technically, it will, at a stroke, > > make > > > Mahout the highest performing machine learning framework around. > > > > > > *Development Roadmap* > > > > > > Of the requirements that people have been talking about on the main > > mailing > > > list, the following capabilities will be provided by this contribution: > > > > > > 1) high performance distributed linear algebra > > > > > > 2) basic machine learning codes including logistic regression, other > > > generalized > > > linear modeling codes, random forest, clustering > > > > > > 3) standard file format parsing system (CSV, Lucene, parquet, other) x > > > (continuous, constant, categorical, word-like, text-like) > > > > > > 4) standard web-based basic applications for common operations > > > > > > 5) language bindings (Java, Scala, R, other) > > > > > > 6) interactive + batch use > > > > > > 7) common representation/good abstraction over representation > > > > > > 8) platform diversity, localhost, with/without ( Hadoop, Yarn, Mesos, > > EC2, > > > GCE ) > > > > > > > > > *Backstory* > > > > > > I was recently approached by the Sri Satish, CEO and co-founder of > 0xdata > > > who > > > wanted to explore whether they could donate some portion of the h2o > > > framework and technology to Mahout. I was skeptical since all that I > had > > > previously seen was the application level demos for this system and was > > not > > > at all familiar with the technology underneath. One of the co-founders > of > > > 0xdata, however, is Cliff Click who was one of the co-authors of the > > server > > > HotSpot compiler. That alone made the offer worth examining. > > > > > > Over the last few weeks, the technical team of 0xdata has been working > > with > > > me to work out whether this contribution would be useful to Mahout. > > > > > > My strong conclusion is that the donation, with some associated shim > work > > > that 0xdata is committing to doing will satisfy roughly 80% of the > goals > > > that have emerged other the last week or so of discussion. Just as > > > important, this donation connects Mahout to new communities who are > very > > > actively working at the frontiers machine learning which is likely to > > > inject lots of new blood and excitement into the Mahout community. > This > > > has huge potential outside of Mahout itself as well since having a very > > > strong technical infrastructure that we can all use across many > projects > > > has the potential to have the same sort of impact on machine learning > > > applications and products that Hadoop has had for file-based parallel > > > processing. Coming together on a common platform has the potential to > > > create markets that would otherwise not exist if we don't have this > > > commonality. > > > > > > > > > *Technical Underpinnings* > > > > > > At the lowest level, the h2o framework provides a way to have named > > objects > > > stored in memory across a cluster in directly computable form. H2o > also > > > provides a very fine-grained parallel execution framework that allows > > > computation to be moved close to the data while maintaining > computational > > > efficiency with tasks as small as milliseconds in scale. Objects live > on > > > multiple machines and live until they are explicitly deallocated or > until > > > the framework is terminated. > > > > > > Additional machines can join the framework, but data isn't > automatically > > > balanced, nor is it assumed that failures are handled within the > > framework. > > > As might be expected given the background of the authors, some pretty > > > astounding things are done using JVM magic so coding at this lowest > level > > > is remarkably congenial. > > > > > > This framework can be deployed as a map-only Hadoop program, or as a > > bunch > > > of independent programs which borg together as they come up. > > Importantly, > > > it is trivial to start a single node framework as well for easy > > development > > > and testing. > > > > > > On top of this lowest level, there are math libraries which implement > low > > > level > > > operations as well as a variety of machine learning algorithms. These > > > include > > > high quality implementations of a variety of machine learning programs > > > including > > > generalized linear modeling with binomial logistic regression and good > > > regularization, linear regression, neural networks, random forests and > so > > > on. > > > There are also parsing codes which will load formatted data in parallel > > from > > > persistency layers such as HDFS or conventional files. > > > > > > At the level of these learning programs, there are web interfaces which > > > allow > > > data elements in the framework to be created, managed and deleted. > > > > > > There is also an R binding for h2o which allows programs to access and > > > manage h2o objects. Functions defined in an R-like language can be > > applied > > > in parallel to > > > data frames stored in the h2o framework. > > > > > > *Proposed Developer User Experience* > > > > > > I see several kinds of users. These include numerical developers > > (largely > > > mathematicians), Java or Scala developers (like current Mahout devs), > and > > > data > > > analysts. > > > > > > - Local h2o single-node cluster > > > - Temporary h2o cluster > > > - Shared h2o cluster > > > > > > All of these modes will be facilitated by the proposed development. > > > > > > *Complementarity with Other Platforms* > > > > > > I view h2o as complementary with Hadoop and Spark because it provides a > > > solid in-memory execution engine as opposed to a general out-of-core > > > computation model that other map-reduce engines like Hadoop and Spark > > > implement or more general dataflow systems like Stratosphere, Tez or > > Drill. > > > > > > Also, h2o provides no persistence but depends on other systems for that > > > such as NFS, HDFS, NAS or MapR. > > > > > > H2o is also nicely complimentary to R in that R can invoke operations > and > > > move data to and from h2o very easily. > > > > > > *Required Additional Work* > > > > > > Sparse matrices > > > Linear algebra bindings > > > Class-file magic to allow off-the-cuff function definitions > > > > > > -- > ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc > +1-408.316.8192 >
