There have been several comments here and elsewhere that allude to whether providing h2o based components would conflict with Spark or other bases for systems.
As I see it, Spark, Stratosphere, Tez and Drill are trying to fix Hadoop's map-reduce layer in various ways. This isn't going to fix what is wrong with numerical computing in Mahout because the problems are different. To my mind, the key problems for numerical computing include: a) efficient, very fine-grained parallelism (think microseconds) b) efficient in-memory mutable storage c) no serialization of data between steps These problems are not even addressed by most data-flow architectures because they are trying to solve data-flow problems and none of these three problems are data-flow issues. In fact, (b) and (c) are simply contradictory. In contrast, these that the three core issues that h2o attempts to address. This is good and wonderful, but it doesn't imply that we can do without data-flow primitives as well. Data-flow is a wonderfully apt idiom for describing variable extraction work and it makes huge sense to have both good clustered matrix math as well as really good data-flow. On Wed, Mar 12, 2014 at 11:25 PM, Manoj Awasthi <[email protected]>wrote: > This sounds very exciting and enriching! Look forward to the developments > on this front. > > > On Thu, Mar 13, 2014 at 11:16 AM, SriSatish Ambati <[email protected] > >wrote: > > > We are excited at the possibilities of this convergence. > > > > A fan of Mahout 's vision and how it captured the imagination of machine > > learning enthusiasts over the years.. > > (Still fondly recollect Isabel's spirited talk at ApacheCon years ago!) > > > > We found that a real product, hacker and an open source developer culture > > was the need. > > The R community has also been looking for a package that solved > distributed > > frames (in-memory) & parallel packages for the algorithms behind. > > Our team has executed on a lots of these inspirations fast & furiously in > > open source over the past two years. > > > > We hope to enrich & fulfill the day-to-day workflows of the Machine > > Learning users world-wide through this. 0xdata built H2O, which is an > open > > source community inspired by the vision of Google-scale Machine Learning > > for the Enterprise using existing APIs like R. H2O has been Apache v2, > > https://github.com/0xdata/h2o for quite sometime now. > > http://jira.0xdata.com/ Underlying a fine-grain in-memory parallelism > is a > > distributed k/v store (non-blocking hashmap) purpose built for doing > > quantitative queries at scale while keeping consistency [1] > > > > With help of legendary Scientific Advise from Stanford's Rob Tibshirani, > > Trevor Hastie, Stephen Boyd, a tight-feedback based partnership with > > customers and phenomenal distributed systems & math teams [0], 0xdata > > built some of the fastest and accurate [4] distributed algorithms > in-Memory > > - Generalized Logistic Modeling, Gradient Boosting Machine, Random Forest > > http://docs.0xdata.com/resources/algoroadmap.html > > > > For example, our billion row regression takes under 6 seconds on a > 48-node > > cluster! These and other math legos are accessible via R, Java, Python, > > JavaScript and Scala. Our R package is headed to CRAN this week & users > can > > drive from an R REPL. http://docs.0xdata.com/Ruser/Rpackage.html > > H2O is also easy to install on or off Hadoop, on or off cloud[3] or on a > > laptop. http://docs.0xdata.com/deployment/hadoop.html > > > > We have been brewing an authentic product focused movement of systems, > > math, ml and data scientists and spreading the word in meetups - > > ( http://0xdata.com/h2o-2/community/ ) > > > > The vision is that a convergence with the Mahout community would bring > > fresh infusion of product & hacker culture needed to potentially make > > Mahout NextGen the principal platform for machine learning at scale that > > can run in-memory, on Hadoop, or on Spark. This will be part of the > ongoing > > evolution of the Hadoop ecosystem. > > And I'd love to get the grass roots excited, energized and have dev, > users > > and founders (Grant & Isabel) corral much needed support and momentum. > > > > Tactically speaking - > > Like all software projects, we have to start small & in expansive spirals > > to converge at each level given the resources at hand. So a call to > action > > and JIRAs are upcoming. > > Our earliest integration is going to be along the lines of getting > minimal > > artifacts (think mvn) & make the Mahout Matrix API connect over > Distributed > > Frames. > > - SparseMatrix is a real requirement on the road ahead & simple > extensions > > to our core Distributed Frame [1] will get us some of the way. > > > > Together we can make this happen. > > Thanks, Ted, Ellen and others for introducing us to the community! > > > > Looking forward, Sri > > > > Reference: > > [0] Team H2O http://0xdata.com/about/ > > [1] http://www.infoq.com/presentations/api-memory-analytics > > [2] http://www.slideshare.net/0xdata/ > > [3] http://docs.0xdata.com/newuser/ec2.html > > [4] http://test.0xdata.com > > > > > > > > > > > > > > On Wed, Mar 12, 2014 at 6:16 PM, Andrew Musselman < > > [email protected]> wrote: > > > > > Sounds like a large positive step; looking forward to hearing more! > > > > > > > On Mar 12, 2014, at 5:44 PM, Ted Dunning <[email protected]> > > wrote: > > > > > > > > I have been working with a company named 0xdata to help them > contribute > > > > some new software to Mahout. This software will give Mahout the > > ability > > > to > > > > do highly iterative in-memory mathematical computations on a cluster > > or a > > > > single machine. This software also comes with high performance > > > distributed > > > > implementations of k-means, logistic regression, random forest and > > other > > > > algorithms. > > > > > > > > I will be starting a thread about this on the dev list shortly, but I > > > > wanted the PMC members to have a short heads up on what has been > > > happening > > > > now that we have consensus on the 0xdata side of the game. > > > > > > > > I think that this has a major potential to bring in an enormous > amount > > of > > > > contributing community to Mahout. Technically, it will, at a stroke, > > > make > > > > Mahout the highest performing machine learning framework around. > > > > > > > > *Development Roadmap* > > > > > > > > Of the requirements that people have been talking about on the main > > > mailing > > > > list, the following capabilities will be provided by this > contribution: > > > > > > > > 1) high performance distributed linear algebra > > > > > > > > 2) basic machine learning codes including logistic regression, other > > > > generalized > > > > linear modeling codes, random forest, clustering > > > > > > > > 3) standard file format parsing system (CSV, Lucene, parquet, other) > x > > > > (continuous, constant, categorical, word-like, text-like) > > > > > > > > 4) standard web-based basic applications for common operations > > > > > > > > 5) language bindings (Java, Scala, R, other) > > > > > > > > 6) interactive + batch use > > > > > > > > 7) common representation/good abstraction over representation > > > > > > > > 8) platform diversity, localhost, with/without ( Hadoop, Yarn, Mesos, > > > EC2, > > > > GCE ) > > > > > > > > > > > > *Backstory* > > > > > > > > I was recently approached by the Sri Satish, CEO and co-founder of > > 0xdata > > > > who > > > > wanted to explore whether they could donate some portion of the h2o > > > > framework and technology to Mahout. I was skeptical since all that I > > had > > > > previously seen was the application level demos for this system and > was > > > not > > > > at all familiar with the technology underneath. One of the > co-founders > > of > > > > 0xdata, however, is Cliff Click who was one of the co-authors of the > > > server > > > > HotSpot compiler. That alone made the offer worth examining. > > > > > > > > Over the last few weeks, the technical team of 0xdata has been > working > > > with > > > > me to work out whether this contribution would be useful to Mahout. > > > > > > > > My strong conclusion is that the donation, with some associated shim > > work > > > > that 0xdata is committing to doing will satisfy roughly 80% of the > > goals > > > > that have emerged other the last week or so of discussion. Just as > > > > important, this donation connects Mahout to new communities who are > > very > > > > actively working at the frontiers machine learning which is likely to > > > > inject lots of new blood and excitement into the Mahout community. > > This > > > > has huge potential outside of Mahout itself as well since having a > very > > > > strong technical infrastructure that we can all use across many > > projects > > > > has the potential to have the same sort of impact on machine learning > > > > applications and products that Hadoop has had for file-based parallel > > > > processing. Coming together on a common platform has the potential > to > > > > create markets that would otherwise not exist if we don't have this > > > > commonality. > > > > > > > > > > > > *Technical Underpinnings* > > > > > > > > At the lowest level, the h2o framework provides a way to have named > > > objects > > > > stored in memory across a cluster in directly computable form. H2o > > also > > > > provides a very fine-grained parallel execution framework that allows > > > > computation to be moved close to the data while maintaining > > computational > > > > efficiency with tasks as small as milliseconds in scale. Objects > live > > on > > > > multiple machines and live until they are explicitly deallocated or > > until > > > > the framework is terminated. > > > > > > > > Additional machines can join the framework, but data isn't > > automatically > > > > balanced, nor is it assumed that failures are handled within the > > > framework. > > > > As might be expected given the background of the authors, some pretty > > > > astounding things are done using JVM magic so coding at this lowest > > level > > > > is remarkably congenial. > > > > > > > > This framework can be deployed as a map-only Hadoop program, or as a > > > bunch > > > > of independent programs which borg together as they come up. > > > Importantly, > > > > it is trivial to start a single node framework as well for easy > > > development > > > > and testing. > > > > > > > > On top of this lowest level, there are math libraries which implement > > low > > > > level > > > > operations as well as a variety of machine learning algorithms. > These > > > > include > > > > high quality implementations of a variety of machine learning > programs > > > > including > > > > generalized linear modeling with binomial logistic regression and > good > > > > regularization, linear regression, neural networks, random forests > and > > so > > > > on. > > > > There are also parsing codes which will load formatted data in > parallel > > > from > > > > persistency layers such as HDFS or conventional files. > > > > > > > > At the level of these learning programs, there are web interfaces > which > > > > allow > > > > data elements in the framework to be created, managed and deleted. > > > > > > > > There is also an R binding for h2o which allows programs to access > and > > > > manage h2o objects. Functions defined in an R-like language can be > > > applied > > > > in parallel to > > > > data frames stored in the h2o framework. > > > > > > > > *Proposed Developer User Experience* > > > > > > > > I see several kinds of users. These include numerical developers > > > (largely > > > > mathematicians), Java or Scala developers (like current Mahout devs), > > and > > > > data > > > > analysts. > > > > > > > > - Local h2o single-node cluster > > > > - Temporary h2o cluster > > > > - Shared h2o cluster > > > > > > > > All of these modes will be facilitated by the proposed development. > > > > > > > > *Complementarity with Other Platforms* > > > > > > > > I view h2o as complementary with Hadoop and Spark because it > provides a > > > > solid in-memory execution engine as opposed to a general out-of-core > > > > computation model that other map-reduce engines like Hadoop and Spark > > > > implement or more general dataflow systems like Stratosphere, Tez or > > > Drill. > > > > > > > > Also, h2o provides no persistence but depends on other systems for > that > > > > such as NFS, HDFS, NAS or MapR. > > > > > > > > H2o is also nicely complimentary to R in that R can invoke operations > > and > > > > move data to and from h2o very easily. > > > > > > > > *Required Additional Work* > > > > > > > > Sparse matrices > > > > Linear algebra bindings > > > > Class-file magic to allow off-the-cuff function definitions > > > > > > > > > > > -- > > ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc > > +1-408.316.8192 > > >
