+1. Happy to help get it migrated!
On Mar 12, 2014, at 8:44 PM, Ted Dunning <[email protected]> wrote: > I have been working with a company named 0xdata to help them contribute > some new software to Mahout. This software will give Mahout the ability to > do highly iterative in-memory mathematical computations on a cluster or a > single machine. This software also comes with high performance distributed > implementations of k-means, logistic regression, random forest and other > algorithms. > > I will be starting a thread about this on the dev list shortly, but I > wanted the PMC members to have a short heads up on what has been happening > now that we have consensus on the 0xdata side of the game. > > I think that this has a major potential to bring in an enormous amount of > contributing community to Mahout. Technically, it will, at a stroke, make > Mahout the highest performing machine learning framework around. > > *Development Roadmap* > > Of the requirements that people have been talking about on the main mailing > list, the following capabilities will be provided by this contribution: > > 1) high performance distributed linear algebra > > 2) basic machine learning codes including logistic regression, other > generalized > linear modeling codes, random forest, clustering > > 3) standard file format parsing system (CSV, Lucene, parquet, other) x > (continuous, constant, categorical, word-like, text-like) > > 4) standard web-based basic applications for common operations > > 5) language bindings (Java, Scala, R, other) > > 6) interactive + batch use > > 7) common representation/good abstraction over representation > > 8) platform diversity, localhost, with/without ( Hadoop, Yarn, Mesos, EC2, > GCE ) > > > *Backstory* > > I was recently approached by the Sri Satish, CEO and co-founder of 0xdata > who > wanted to explore whether they could donate some portion of the h2o > framework and technology to Mahout. I was skeptical since all that I had > previously seen was the application level demos for this system and was not > at all familiar with the technology underneath. One of the co-founders of > 0xdata, however, is Cliff Click who was one of the co-authors of the server > HotSpot compiler. That alone made the offer worth examining. > > Over the last few weeks, the technical team of 0xdata has been working with > me to work out whether this contribution would be useful to Mahout. > > My strong conclusion is that the donation, with some associated shim work > that 0xdata is committing to doing will satisfy roughly 80% of the goals > that have emerged other the last week or so of discussion. Just as > important, this donation connects Mahout to new communities who are very > actively working at the frontiers machine learning which is likely to > inject lots of new blood and excitement into the Mahout community. This > has huge potential outside of Mahout itself as well since having a very > strong technical infrastructure that we can all use across many projects > has the potential to have the same sort of impact on machine learning > applications and products that Hadoop has had for file-based parallel > processing. Coming together on a common platform has the potential to > create markets that would otherwise not exist if we don't have this > commonality. > > > *Technical Underpinnings* > > At the lowest level, the h2o framework provides a way to have named objects > stored in memory across a cluster in directly computable form. H2o also > provides a very fine-grained parallel execution framework that allows > computation to be moved close to the data while maintaining computational > efficiency with tasks as small as milliseconds in scale. Objects live on > multiple machines and live until they are explicitly deallocated or until > the framework is terminated. > > Additional machines can join the framework, but data isn't automatically > balanced, nor is it assumed that failures are handled within the framework. > As might be expected given the background of the authors, some pretty > astounding things are done using JVM magic so coding at this lowest level > is remarkably congenial. > > This framework can be deployed as a map-only Hadoop program, or as a bunch > of independent programs which borg together as they come up. Importantly, > it is trivial to start a single node framework as well for easy development > and testing. > > On top of this lowest level, there are math libraries which implement low > level > operations as well as a variety of machine learning algorithms. These > include > high quality implementations of a variety of machine learning programs > including > generalized linear modeling with binomial logistic regression and good > regularization, linear regression, neural networks, random forests and so > on. > There are also parsing codes which will load formatted data in parallel from > persistency layers such as HDFS or conventional files. > > At the level of these learning programs, there are web interfaces which > allow > data elements in the framework to be created, managed and deleted. > > There is also an R binding for h2o which allows programs to access and > manage h2o objects. Functions defined in an R-like language can be applied > in parallel to > data frames stored in the h2o framework. > > *Proposed Developer User Experience* > > I see several kinds of users. These include numerical developers (largely > mathematicians), Java or Scala developers (like current Mahout devs), and > data > analysts. > > - Local h2o single-node cluster > - Temporary h2o cluster > - Shared h2o cluster > > All of these modes will be facilitated by the proposed development. > > *Complementarity with Other Platforms* > > I view h2o as complementary with Hadoop and Spark because it provides a > solid in-memory execution engine as opposed to a general out-of-core > computation model that other map-reduce engines like Hadoop and Spark > implement or more general dataflow systems like Stratosphere, Tez or Drill. > > Also, h2o provides no persistence but depends on other systems for that > such as NFS, HDFS, NAS or MapR. > > H2o is also nicely complimentary to R in that R can invoke operations and > move data to and from h2o very easily. > > *Required Additional Work* > > Sparse matrices > Linear algebra bindings > Class-file magic to allow off-the-cuff function definitions -------------------------------------------- Grant Ingersoll | @gsingers http://www.lucidworks.com
