http://www.youtube.com/watch?v=CDP6NayO1yM search to 13:20
On Fri, Mar 14, 2014 at 1:24 PM, Dmitriy Lyubimov <[email protected]> wrote: > I am sorry for saying this -- I just feel I am entitled to an opinion. > This is exactly the style of api in Hadoop and Mahout that allows Evan > Sparks to make his very convincing talk points. > > > On Fri, Mar 14, 2014 at 12:54 PM, SriSatish Ambati > <[email protected]>wrote: > >> H2O's unknown only to become known. All of us have watched every open >> source phenomenon, including successful go through that phase. Linux, >> Apache, Hadoop, even upto recently Spark were all targets of fear and >> uncertainty. I'm a fan of Spark and Matei's relentless pursuit over >> the years. Quantitative primitives wasn't not the focus for them. >> >> Dimitriy's point on programming model is a good one - >> - Our programming model is map reduce on a distributed chunked k/v >> store. Very plain jane as it gets. >> - We don't feel competitive with Spark; >> An algorithmic designer should be able to define algorithms that run >> on multiple architectures. >> H2O can easily embrace Spark at the Scala/MLI layer or at the RDD data >> ingest/store layer. >> Some of our users use SHARK for pre-processing and H2O for the machine >> learning. >> >> Reality is there is no architectural silver bullet for any good large >> body of real world use cases; >> Interoperability & heterogeneity in the data center and developers is >> given. We should be open to embracing that. >> >> - The point about better documentation of the architecture is >> well-taken. And something that is being addressed. >> The Algorithms themselves are well documented and work as advertised, >> in production environments. >> (- The product takes the documentation with it.) >> >> Let me segue to a present a simple LinearRegression program on H2O, >> (one we use in some of our meetups & community efforts.) >> https://github.com/0xdata/h2o/blob/master/src/main/java/hex/LR2.java >> >> *The commentary for the code -* >> *1. Breakdown problem in discrete phases.* >> >> >> // Pass 1: compute sums & sums-of-squares >> >> CalcSumsTask lr1 = new CalcSumsTask().doAll(vec_x, vec_y); >> >> >> // Pass 2: Compute squared errors >> >> final double meanX = lr1._sumX/nrows; >> >> final double meanY = lr1._sumY/nrows; >> >> CalcSquareErrorsTasks lr2 = new CalcSquareErrorsTasks(meanX, >> meanY).doAll(vec_x, vec_y); >> >> >> // Pass 3: Compute the regression >> >> beta1 = lr2._XYbar / lr2._XXbar; >> >> beta0 = meanY - beta1 * meanX; >> >> CalcRegressionTask lr3 = new CalcRegressionTask(beta0, beta1, >> meanY).doAll(vec_x, vec_y); >> >> >> *2. Use Map / Reduce programming model for the Tasks * >> * - Think of chunks as units of batch over data.* >> >> public static class CalcSumsTask extends MRTask2<CalcSumsTask> { >> >> long _n; // Rows used >> >> double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's >> >> @Override public void map( Chunk xs, Chunk ys ) { >> >> for( int i=0; i<xs._len; i++ ) { >> >> double X = xs.at0(i); >> >> double Y = ys.at0(i); >> >> if( !Double.isNaN(X) && !Double.isNaN(Y)) { >> >> _sumX += X; >> >> _sumY += Y; >> >> _sumX2+= X*X; >> >> _n++; >> >> } >> >> } >> >> } >> >> @Override public void reduce( CalcSumsTask lr1 ) { >> >> _sumX += lr1._sumX ; >> >> _sumY += lr1._sumY ; >> >> _sumX2+= lr1._sumX2; >> >> _n += lr1._n; >> >> } >> >> } >> >> >> *3. High-level Goals, * >> * Make the code read like close to a math dsl - easier to recruit math >> folks to debug or spot errors.* >> * - Autogen JIT friendly optimized code where need be. * >> >> *- Minimize passes over data.* >> >> *4. Other best practices,* >> >> * Separate input and output data formats from Algorithm.* >> >> * Use primitives for better memory management.* >> * Generate JSON and HTML API for easy testing & usability.* >> >> >> [Reference] >> https://github.com/0xdata/h2o/blob/master/src/main/java/hex/LR2.java >> >> package hex; >> >> import water.*; >> import water.api.DocGen; >> import water.fvec.*; >> import water.util.RString; >> >> public class LR2 extends Request2 { >> static final int API_WEAVER = 1; // This file has auto-gen'd doc & json >> fields >> static public DocGen.FieldDoc[] DOC_FIELDS; // Initialized from Auto-Gen >> code. >> >> // This Request supports the HTML 'GET' command, and this is the help >> text >> // for GET. >> static final String DOC_GET = "Linear Regression between 2 columns"; >> >> @API(help="Data Frame", required=true, filter=Default.class) >> Frame source; >> >> @API(help="Column X", required=true, filter=LR2VecSelect.class) >> Vec vec_x; >> >> @API(help="Column Y", required=true, filter=LR2VecSelect.class) >> Vec vec_y; >> class LR2VecSelect extends VecSelect { LR2VecSelect() { >> super("source"); } >> } >> >> @API(help="Pass 1 msec") long pass1time; >> @API(help="Pass 2 msec") long pass2time; >> @API(help="Pass 3 msec") long pass3time; >> @API(help="nrows") long nrows; >> @API(help="beta0") double beta0; >> @API(help="beta1") double beta1; >> @API(help="r-squared") double r2; >> @API(help="SSTO") double ssto; >> @API(help="SSE") double sse; >> @API(help="SSR") double ssr; >> @API(help="beta0 Std Error") double beta0stderr; >> @API(help="beta1 Std Error") double beta1stderr; >> >> @Override public Response serve() { >> // Pass 1: compute sums & sums-of-squares >> long start = System.currentTimeMillis(); >> CalcSumsTask lr1 = new CalcSumsTask().doAll(vec_x, vec_y); >> long pass1 = System.currentTimeMillis(); >> pass1time = pass1 - start; >> nrows = lr1._n; >> >> // Pass 2: Compute squared errors >> final double meanX = lr1._sumX/nrows; >> final double meanY = lr1._sumY/nrows; >> CalcSquareErrorsTasks lr2 = new CalcSquareErrorsTasks(meanX, meanY). >> doAll(vec_x, vec_y); >> long pass2 = System.currentTimeMillis(); >> pass2time = pass2 - pass1; >> ssto = lr2._YYbar; >> >> // Compute the regression >> beta1 = lr2._XYbar / lr2._XXbar; >> beta0 = meanY - beta1 * meanX; >> CalcRegressionTask lr3 = new CalcRegressionTask(beta0, beta1, meanY). >> doAll(vec_x, vec_y); >> long pass3 = System.currentTimeMillis(); >> pass3time = pass3 - pass2; >> >> long df = nrows - 2; >> r2 = lr3._ssr / lr2._YYbar; >> double svar = lr3._rss / df; >> double svar1 = svar / lr2._XXbar; >> double svar0 = svar/nrows + meanX*meanX*svar1; >> beta0stderr = Math.sqrt(svar0); >> beta1stderr = Math.sqrt(svar1); >> sse = lr3._rss; >> ssr = lr3._ssr; >> >> return Response.done(this); >> } >> >> public static class CalcSumsTask extends MRTask2<CalcSumsTask> { >> long _n; // Rows used >> double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's >> @Override public void map( Chunk xs, Chunk ys ) { >> for( int i=0; i<xs._len; i++ ) { >> double X = xs.at0(i); >> double Y = ys.at0(i); >> if( !Double.isNaN(X) && !Double.isNaN(Y)) { >> _sumX += X; >> _sumY += Y; >> _sumX2+= X*X; >> _n++; >> } >> } >> } >> @Override public void reduce( CalcSumsTask lr1 ) { >> _sumX += lr1._sumX ; >> _sumY += lr1._sumY ; >> _sumX2+= lr1._sumX2; >> _n += lr1._n; >> } >> } >> >> >> public static class CalcSquareErrorsTasks extends MRTask2< >> CalcSquareErrorsTasks> { >> final double _meanX, _meanY; >> double _XXbar, _YYbar, _XYbar; >> CalcSquareErrorsTasks( double meanX, double meanY ) { _meanX = meanX; >> _meanY = meanY; } >> @Override public void map( Chunk xs, Chunk ys ) { >> for( int i=0; i<xs._len; i++ ) { >> double Xa = xs.at0(i); >> double Ya = ys.at0(i); >> if(!Double.isNaN(Xa) && !Double.isNaN(Ya)) { >> Xa -= _meanX; >> Ya -= _meanY; >> _XXbar += Xa*Xa; >> _YYbar += Ya*Ya; >> _XYbar += Xa*Ya; >> } >> } >> } >> @Override public void reduce( CalcSquareErrorsTasks lr2 ) { >> _XXbar += lr2._XXbar; >> _YYbar += lr2._YYbar; >> _XYbar += lr2._XYbar; >> } >> } >> >> >> public static class CalcRegressionTask extends >> MRTask2<CalcRegressionTask> >> { >> final double _meanY; >> final double _beta0, _beta1; >> double _rss, _ssr; >> CalcRegressionTask(double beta0, double beta1, double meanY) {_beta0= >> beta0; _beta1=beta1; _meanY=meanY;} >> @Override public void map( Chunk xs, Chunk ys ) { >> for( int i=0; i<xs._len; i++ ) { >> double X = xs.at0(i); double Y = ys.at0(i); >> if( !Double.isNaN(X) && !Double.isNaN(Y) ) { >> double fit = _beta1*X + _beta0; >> double rs = fit-Y; >> _rss += rs*rs; >> double sr = fit-_meanY; >> _ssr += sr*sr; >> } >> } >> } >> >> @Override public void reduce( CalcRegressionTask lr3 ) { >> _rss += lr3._rss; >> _ssr += lr3._ssr; >> } >> } >> >> /** Return the query link to this page */ >> public static String link(Key k, String content) { >> RString rs = new RString("<a >> href='LR2.query?data_key=%$key'>%content</a>"); >> rs.replace("key", k.toString()); >> rs.replace("content", content); >> return rs.toString(); >> } >> } >> >> >> thanks, Sri >> >> On Fri, Mar 14, 2014 at 9:39 AM, Pat Ferrel <[email protected]> >> wrote: >> >> > Love the architectural discussion but sometimes the real answers can be >> > hidden by minutiae. >> > >> > Dimitriy is there enough running on Spark to compare to a DRM >> > implementation on H2O? 0xdata, go ahead and implement DRM on H2O. If >> "the >> > proof is in the pudding" why not compare?. >> > >> > We really ARE betting Mahout on H2O Ted. I don't buy your denial. If >> > Mahout moves to another faster better execution engine is will do so >> only >> > once in the immediate future. The only real alternative to your >> proposal is >> > a call to action for committers to move Mahout to Spark or other more >> well >> > known engine. These will realistically never coexist. >> > >> > >> > Some other concerns: >> > >> > If H2O in only 2x as fast as Mahout on Spark I'd be dubious of adopting >> an >> > unknown or unproven platform. The fact that it is custom made for BD >> > Analytics is both good and bad. It means that expertise we develop for >> H2O >> > may not be useful for other parallel computing problems. Also it seems >> from >> > the docs that the design point for 0xdata is not the same as Mahout. >> 0xdata >> > is trying to build a faster BD analytics platform (OLAP), not sparse >> data >> > machine learning in daily production. None of the things I use in Mahout >> > are in 0xdata, I suspect because of this mismatch. It doesn't mean it >> wont >> > work but in lieu of the apples to apples comparison mentioned above it >> does >> > worry me. >> > >> > On Mar 14, 2014, at 7:21 AM, Dmitriy Lyubimov <[email protected]> >> wrote: >> > >> > > I think that the proposal under discussion involves adding a >> dependency >> > on >> > > a maven released h2o artifact plus a contribution of Mahout >> translation >> > > layers. These layers would give a sub-class of Matrix (and Vector) >> which >> > > allow direct control over life span across multiple jobs but would >> > > otherwise behave like their in-memory counter-parts. >> > >> > Well I suppose that means they have to live in some processes which are >> not >> > processes I already have. And they have to be managed. So this is not >> just >> > an in-core subsystem. Sounds like a new back to me. >> > >> > >> >> > >> In Hadoop, every iteration must be scheduled as a separate job, >> rereads >> > >> invariant data and materializes its result to hdfs. Therefore, >> iterative >> > >> programs on Hadoop are an order of magnitude slower than on systems >> that >> > >> have dedicated support for iterations. >> > >> >> > >> Does h2o help here or would we need to incorporate another system for >> > such >> > >> tasks? >> > >> >> > > >> > > H2o helps here in a couple of different ways. >> > > >> > > The first and foremost is that primitive operations are easy >> > > Additionally, data elements can survive a single programs execution. >> > This >> > > means that programs can be executed one after another to get composite >> > > effects. This is astonishingly fast ... more along the speeds one >> would >> > > expect from a single processor program. >> > >> > I think the problem here is that the authors keep comparing these >> > techniques to slowest model available which is Hadoop. >> > >> > But this is exact execution model of Spark. You get stuff repeatedly >> > executed on in-memory partitions and get approximately the speed of >> > iterative speed execution. I won't describe it as astonishing, though, >> > because indeed it is as fast as you can get things done in memory, no >> > faster, no slower. That's for example the reason why my linalg >> optimizer is >> > not hesitating to compute exact matrix geometry lazily if not known, for >> > optimization purposes, because the answer will be back in between 40 to >> 200 >> > ms, assuming adequate RAM allocation. I have been using these paradigms >> for >> > more than a year now. This is all good stuff. I would not use word >> > astonshing, but sensible, yes. Main concern is if programming model is >> > called to be sacrificed just to do sensible things here. >> > >> > > >> > >> > >> (2) Efficient join implementations >> > >> >> > >> If we look at a lot of Mahout's algorithm implementations with a >> > database >> > >> hat on, than we see lots of handcoded joins in our codebase, because >> > Hadoop >> > >> does not bring join primitives. This has lots of drawbacks, e.g. it >> > >> complicates the codebase and leads to hardcoded join strategies that >> > bake >> > >> certain assumptions into the code (e.g. ALS uses a broadcast-join >> which >> > >> assumes that one side fits into memory on each machine, >> RecommenderJob >> > uses >> > >> a repartition-join which is scalable but very slow for small >> > inputs,...). >> > >> >> > >> > +1 >> > >> > > I think that h2o provides this but do not know in detail how. I do >> know >> > > that many of the algorithms already coded make use of matrix >> > multiplication >> > > which is essentially a join operation. >> > >> > Essentially a join? The spark module optimizer picks out of at least 3 >> > implementations: zip+combine, block-wise cartesian and finally, yes, >> > join+combine. Depends on orientation and the earlier operators in >> pipeline. >> > That's exactly my point about flexibility of programming model from the >> > optimizer point of view. >> > >> > > >> > >> Obviously, I'd love to get rid of handcoded joins and implement ML >> > >> algorithms (which is hard enough on its own). Other systems help with >> > this >> > >> already. Spark, for example offers broadcast and repartition-join >> > >> primitives, Stratosphere has a join primitive and an optimizer that >> > >> automatically decides which join strategy to use, as well as a highly >> > >> optimized hybrid hashjoin implementation that can gracefully go >> > out-of-core >> > >> under memory pressure. >> > >> >> > > >> > > When you get into the realm of things on this level of >> sophistication, I >> > > think that you have found the boundary where alternative foundations >> like >> > > Spark and Stratosphere are better than h2o. The novelty with h2o is >> the >> > > hypothesis that a very large fraction of interesting ML algorithms >> can be >> > > implemented without this power. So far, this seems correct. >> > >> > Again, this is largely along the lines "let's make a library of few >> > hand-optimized things". Which is noble, but -- I would argue -- not >> > ambitious enough. Most of the distributed ML projects do just that. We >> > should perhaps think along the lines what could be differentiating >> factor >> > for us. >> > >> > Not that we should not care about performance. It should be, of course, >> > *sensible*. (Our MR code base of course does not give us that, as u >> said, >> > jumping off MR wagon is not even a question). >> > >> > If you can forgive me for drawing parallels here, it's a difference >> between >> > something like Weka and R. Collection vs. platform _and_ collection >> induced >> > by platform. Platform of course also positively feeds into the speed of >> > collection growth directly. >> > >> > When i use R, i don't have code consisting of algorithms calls. That is, >> > yes, it is doing off-the shelf use now and then, but it is far from >> being >> > the only thing it is doing. 95% of the things is as simple feature >> > massaging. I place no value in R for providing GLM for me. Gosh, this >> > particular offering is virtually hanging from anywhere these days. >> > >> > But i do place value into it for doing custom feature prep and for, for >> > example being able to get 100 grad students to try their own k-means >> > implementation in seconds. >> > >> > Why? >> > >> > There has been a lot of talk here about building community and >> > contributions etc. Platform is what builds it, most directly and >> amazingly. >> > I would go on a limb here and say that Spark and mlib are experiencing >> > explosive growth of contributions not because it can do things with >> > in-memory datasets (which is important, but like i said, is has been >> long >> > since viewed no more than just sensible), but because of clarity of its >> > programming model. I think we have seen a very solid evidence that >> clarity >> > and richness of programming model was the thing that attracts >> communities. >> > >> > If we grade roughly (very roughly!) what we have today, I can easily >> argue >> > that the acceptance levels follow the programming model very closely. >> e.g. >> > if i try to sort project with distributed programming models by (my >> > subjectively percieved) popularity, from bottom to top : >> > >> > ******** >> > >> > Hadoop MapReduce -- ok i don't even know how to organize the critique >> here, >> > too long of a list, almost nobody (but Mahout) does these things this >> way >> > today. Certainly, none of my last 2 employers did. >> > >> > hive -- SQL like with severly constrained general programming language >> > capabilities, not conducive to batches. Pretty much limits to ad-hoc >> > exploration. >> > >> > Pig -- a bit better, can write batches, but extra functionality mixins >> > (UDFs) are still a royal pain >> > >> > Cascading -- even easier, rich primitives, easy batches, some manual >> > optimization of physical plan elements. One of the big cons is the >> > limitation of a rigid dataset tuple structure, >> > >> > FlumeJava (Crunch in apache world) -- even better, but java closures are >> > just plain ugly, zero "scriptability". Its community has been hurt a >> little >> > bit because of the fact that it was a bit late to the show compared to >> > others (e.g. cascading), but it leveled off quickly. >> > >> > Scala bindings for Cascading (Scalding) and FlumeJava -- better, hell, >> well >> > better on the closure and FP front! But still not being native to scala >> > from get go creates some miniature problems there. >> > >> > Spark -- i think is fair to say the current community "king" above >> those >> > all -- all the aforementioned platform model pains are eliminated, >> although >> > on performance side i think there're still some pockets for improvement >> on >> > cost-based optimization side of things. >> > >> > Stratosphere might be more interesting in this department, but I am not >> > sure at this point if that necessarily will translate into performance >> > benefits for ML. >> > >> > ******** >> > >> > The first few things are using the same computing model underneath and >> > essentially are having roughly the same performance. Yet there's clear >> > variation in community and acceptance. >> > >> > In ML world, we are seeing approximately the same thing. The clearer the >> > programming model and ease of integration in to the process, the wider >> the >> > acceptance. I probably can pretty successfully argue that current most >> > performant ML "thing" as it stands is GraphLab. And it is pretty >> > comprehensive in problem coverage (I think it does cover e.g. >> recommender >> > concerns greater than h2o and Mahout together, for example). But i can >> also >> > pretty successfully argue it is being rejected a lot of time for being >> just >> > a collection (which is, in addition, is hard to call from jvm, i.e. >> > integration again). It is actually so bad, that people in my company >> would >> > rather go back to 20 snow wired R servers than think of even >> entertaining >> > an architecture including GraphLab component. (Yes, variance of this >> sample >> > as high as it gets, just saying what i hear). >> > >> > So as a general guideline to solve the current ills, it would stand to >> > reason to adopt platform priority and algorithm collection as a >> function of >> > such platform, rather than collection as a function of few dedicated >> > efforts. Yes -- it has to be *sensibly* performant -- but this does not >> > have to be mostly a concern of the code in this project directly. >> Rather, >> > it has to be a concern of the backs (i.e. dependencies) and our in-core >> > support. >> > >> > Our pathological fear of being a performance scapegoat totally obscurs >> the >> > fact that performance is mostly a function of the back and that we were >> > riding on a wrong back for a long time. As long as we don't cling to a >> > particular back, it shouldn't be a problem. What one would rather >> accept: >> > being initially 5x slower than Graphlab (but on par with MLlib) but beat >> > these on community support, or being on par but anemic in community? If >> 02 >> > platform feels the performance has been so important to sacrifice >> > programming model, why they feel the need to join an apache project? >> After >> > all, they have been an open project for a long time already and have >> built >> > their own community, big or small. Spark has just now become a top-level >> > apache project, and joined apache incubator mere 2 months ago and did >> not >> > have any trouble attracting community outside Apache at all. >> Stratosphere >> > is not even in Apache. Similarly, did it help Mahout to be in Apache to >> get >> > anywhere close in community measurement to these? So this totally >> refutes >> > the argument one has to be an Apache project to get its exclusive >> qualities >> > highlighted. Perhaps in the end it is more about the importance of the >> > qualities to the community and quality of contributions. >> > >> > A lot of this platform and programming model priority is probably >> easier to >> > say than do, but some of linalg and data frame things are ridiculously >> easy >> > though in terms of amount of effort. If i could do linalg optmizer with >> > bindings for sparks with 2 nights a month, the same can be done for >> > multiple backs and data frames in a jiffy. Well, the back should have a >> > clear programming model of course as a prerequisite. Which brings us >> back >> > to the issue of richness of distributed primitives. >> > >> > >> >> >> -- >> ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc >> +1-408.316.8192 >> > >
