Re: 0xdata interested in contributing

Dmitriy Lyubimov Fri, 14 Mar 2014 13:29:07 -0700

http://www.youtube.com/watch?v=CDP6NayO1yM search to 13:20



On Fri, Mar 14, 2014 at 1:24 PM, Dmitriy Lyubimov <[email protected]> wrote:

> I am sorry for saying this -- I just feel I am entitled to an opinion.
> This is exactly the style of api in Hadoop and  Mahout that allows Evan
> Sparks to make his very convincing talk points.
>
>
> On Fri, Mar 14, 2014 at 12:54 PM, SriSatish Ambati 
> <[email protected]>wrote:
>
>> H2O's unknown only to become known. All of us have watched every open
>> source phenomenon, including successful go through that phase. Linux,
>> Apache, Hadoop, even upto recently Spark were all targets of fear and
>> uncertainty. I'm a fan of Spark and Matei's relentless pursuit over
>> the years. Quantitative primitives wasn't not the focus for them.
>>
>> Dimitriy's point on programming model is a good one -
>> - Our programming model is map reduce on a distributed chunked k/v
>> store. Very plain jane as it gets.
>> - We don't feel competitive with Spark;
>> An algorithmic designer should be able to define algorithms that run
>> on multiple architectures.
>> H2O can easily embrace Spark at the Scala/MLI layer or at the RDD data
>> ingest/store layer.
>> Some of our users use SHARK for pre-processing and H2O for the machine
>> learning.
>>
>> Reality is there is no architectural silver bullet for any good large
>> body of real world use cases;
>> Interoperability & heterogeneity in the data center and developers is
>> given. We should be open to embracing that.
>>
>> - The point about better documentation of the architecture is
>> well-taken. And something that is being addressed.
>>  The Algorithms themselves are well documented and work as advertised,
>> in production environments.
>>    (- The product takes the documentation with it.)
>>
>> Let me segue to a present a simple LinearRegression program on H2O,
>> (one we use in some of our meetups & community efforts.)
>> https://github.com/0xdata/h2o/blob/master/src/main/java/hex/LR2.java
>>
>> *The commentary for the code -*
>> *1. Breakdown problem in discrete phases.*
>>
>>
>> // Pass 1: compute sums & sums-of-squares
>>
>>   CalcSumsTask lr1 = new CalcSumsTask().doAll(vec_x, vec_y);
>>
>>
>> // Pass 2: Compute squared errors
>>
>>  final double meanX = lr1._sumX/nrows;
>>
>>  final double meanY = lr1._sumY/nrows;
>>
>>  CalcSquareErrorsTasks lr2 = new CalcSquareErrorsTasks(meanX,
>> meanY).doAll(vec_x, vec_y);
>>
>>
>> // Pass 3: Compute the regression
>>
>>  beta1 = lr2._XYbar / lr2._XXbar;
>>
>>  beta0 = meanY - beta1 * meanX;
>>
>>  CalcRegressionTask lr3 = new CalcRegressionTask(beta0, beta1,
>> meanY).doAll(vec_x, vec_y);
>>
>>
>> *2. Use Map / Reduce programming model for the Tasks *
>> * - Think of chunks as units of batch over data.*
>>
>>   public static class CalcSumsTask extends MRTask2<CalcSumsTask> {
>>
>>     long _n; // Rows used
>>
>>     double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
>>
>>     @Override public void map( Chunk xs, Chunk ys ) {
>>
>>       for( int i=0; i<xs._len; i++ ) {
>>
>>         double X = xs.at0(i);
>>
>>         double Y = ys.at0(i);
>>
>>         if( !Double.isNaN(X) && !Double.isNaN(Y)) {
>>
>>           _sumX += X;
>>
>>           _sumY += Y;
>>
>>           _sumX2+= X*X;
>>
>>           _n++;
>>
>>         }
>>
>>       }
>>
>>     }
>>
>>     @Override public void reduce( CalcSumsTask lr1 ) {
>>
>>       _sumX += lr1._sumX ;
>>
>>       _sumY += lr1._sumY ;
>>
>>       _sumX2+= lr1._sumX2;
>>
>>       _n += lr1._n;
>>
>>     }
>>
>>   }
>>
>>
>> *3. High-level Goals, *
>> * Make the code read like close to a math dsl - easier to recruit math
>> folks to debug or spot errors.*
>> * - Autogen JIT friendly optimized code where need be. *
>>
>> *- Minimize passes over data.*
>>
>> *4. Other best practices,*
>>
>> *  Separate input and output data formats from Algorithm.*
>>
>> * Use primitives for better memory management.*
>> * Generate JSON and HTML API for easy testing & usability.*
>>
>>
>> [Reference]
>> https://github.com/0xdata/h2o/blob/master/src/main/java/hex/LR2.java
>>
>> package hex;
>>
>> import water.*;
>> import water.api.DocGen;
>> import water.fvec.*;
>> import water.util.RString;
>>
>> public class LR2 extends Request2 {
>>   static final int API_WEAVER = 1; // This file has auto-gen'd doc & json
>> fields
>>   static public DocGen.FieldDoc[] DOC_FIELDS; // Initialized from Auto-Gen
>> code.
>>
>>   // This Request supports the HTML 'GET' command, and this is the help
>> text
>>   // for GET.
>>   static final String DOC_GET = "Linear Regression between 2 columns";
>>
>>   @API(help="Data Frame", required=true, filter=Default.class)
>>   Frame source;
>>
>>   @API(help="Column X", required=true, filter=LR2VecSelect.class)
>>   Vec vec_x;
>>
>>   @API(help="Column Y", required=true, filter=LR2VecSelect.class)
>>   Vec vec_y;
>>   class LR2VecSelect extends VecSelect { LR2VecSelect() {
>> super("source"); }
>> }
>>
>>   @API(help="Pass 1 msec") long pass1time;
>>   @API(help="Pass 2 msec") long pass2time;
>>   @API(help="Pass 3 msec") long pass3time;
>>   @API(help="nrows") long nrows;
>>   @API(help="beta0") double beta0;
>>   @API(help="beta1") double beta1;
>>   @API(help="r-squared") double r2;
>>   @API(help="SSTO") double ssto;
>>   @API(help="SSE") double sse;
>>   @API(help="SSR") double ssr;
>>   @API(help="beta0 Std Error") double beta0stderr;
>>   @API(help="beta1 Std Error") double beta1stderr;
>>
>>   @Override public Response serve() {
>>     // Pass 1: compute sums & sums-of-squares
>>     long start = System.currentTimeMillis();
>>     CalcSumsTask lr1 = new CalcSumsTask().doAll(vec_x, vec_y);
>>     long pass1 = System.currentTimeMillis();
>>     pass1time = pass1 - start;
>>     nrows = lr1._n;
>>
>>     // Pass 2: Compute squared errors
>>     final double meanX = lr1._sumX/nrows;
>>     final double meanY = lr1._sumY/nrows;
>>     CalcSquareErrorsTasks lr2 = new CalcSquareErrorsTasks(meanX, meanY).
>> doAll(vec_x, vec_y);
>>     long pass2 = System.currentTimeMillis();
>>     pass2time = pass2 - pass1;
>>     ssto = lr2._YYbar;
>>
>>     // Compute the regression
>>     beta1 = lr2._XYbar / lr2._XXbar;
>>     beta0 = meanY - beta1 * meanX;
>>     CalcRegressionTask lr3 = new CalcRegressionTask(beta0, beta1, meanY).
>> doAll(vec_x, vec_y);
>>     long pass3 = System.currentTimeMillis();
>>     pass3time = pass3 - pass2;
>>
>>     long df = nrows - 2;
>>     r2 = lr3._ssr / lr2._YYbar;
>>     double svar = lr3._rss / df;
>>     double svar1 = svar / lr2._XXbar;
>>     double svar0 = svar/nrows + meanX*meanX*svar1;
>>     beta0stderr = Math.sqrt(svar0);
>>     beta1stderr = Math.sqrt(svar1);
>>     sse = lr3._rss;
>>     ssr = lr3._ssr;
>>
>>     return Response.done(this);
>>   }
>>
>>   public static class CalcSumsTask extends MRTask2<CalcSumsTask> {
>>     long _n; // Rows used
>>     double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
>>     @Override public void map( Chunk xs, Chunk ys ) {
>>       for( int i=0; i<xs._len; i++ ) {
>>         double X = xs.at0(i);
>>         double Y = ys.at0(i);
>>         if( !Double.isNaN(X) && !Double.isNaN(Y)) {
>>           _sumX += X;
>>           _sumY += Y;
>>           _sumX2+= X*X;
>>           _n++;
>>         }
>>       }
>>     }
>>     @Override public void reduce( CalcSumsTask lr1 ) {
>>       _sumX += lr1._sumX ;
>>       _sumY += lr1._sumY ;
>>       _sumX2+= lr1._sumX2;
>>       _n += lr1._n;
>>     }
>>   }
>>
>>
>>   public static class CalcSquareErrorsTasks extends MRTask2<
>> CalcSquareErrorsTasks> {
>>     final double _meanX, _meanY;
>>     double _XXbar, _YYbar, _XYbar;
>>     CalcSquareErrorsTasks( double meanX, double meanY ) { _meanX = meanX;
>> _meanY = meanY; }
>>     @Override public void map( Chunk xs, Chunk ys ) {
>>       for( int i=0; i<xs._len; i++ ) {
>>         double Xa = xs.at0(i);
>>         double Ya = ys.at0(i);
>>         if(!Double.isNaN(Xa) && !Double.isNaN(Ya)) {
>>           Xa -= _meanX;
>>           Ya -= _meanY;
>>           _XXbar += Xa*Xa;
>>           _YYbar += Ya*Ya;
>>           _XYbar += Xa*Ya;
>>         }
>>       }
>>     }
>>     @Override public void reduce( CalcSquareErrorsTasks lr2 ) {
>>       _XXbar += lr2._XXbar;
>>       _YYbar += lr2._YYbar;
>>       _XYbar += lr2._XYbar;
>>     }
>>   }
>>
>>
>>   public static class CalcRegressionTask extends
>> MRTask2<CalcRegressionTask>
>> {
>>     final double _meanY;
>>     final double _beta0, _beta1;
>>     double _rss, _ssr;
>>     CalcRegressionTask(double beta0, double beta1, double meanY) {_beta0=
>> beta0; _beta1=beta1; _meanY=meanY;}
>>     @Override public void map( Chunk xs, Chunk ys ) {
>>       for( int i=0; i<xs._len; i++ ) {
>>         double X = xs.at0(i); double Y = ys.at0(i);
>>         if( !Double.isNaN(X) && !Double.isNaN(Y) ) {
>>           double fit = _beta1*X + _beta0;
>>           double rs = fit-Y;
>>           _rss += rs*rs;
>>           double sr = fit-_meanY;
>>           _ssr += sr*sr;
>>         }
>>       }
>>     }
>>
>>     @Override public void reduce( CalcRegressionTask lr3 ) {
>>       _rss += lr3._rss;
>>       _ssr += lr3._ssr;
>>     }
>>   }
>>
>>   /** Return the query link to this page */
>>   public static String link(Key k, String content) {
>>     RString rs = new RString("<a
>> href='LR2.query?data_key=%$key'>%content</a>");
>>     rs.replace("key", k.toString());
>>     rs.replace("content", content);
>>     return rs.toString();
>>   }
>> }
>>
>>
>> thanks, Sri
>>
>> On Fri, Mar 14, 2014 at 9:39 AM, Pat Ferrel <[email protected]>
>> wrote:
>>
>> > Love the architectural discussion but sometimes the real answers can be
>> > hidden by minutiae.
>> >
>> > Dimitriy is there enough running on Spark to compare to a DRM
>> > implementation on H2O? 0xdata, go ahead and implement DRM on H2O. If
>> "the
>> > proof is in the pudding" why not compare?.
>> >
>> > We really ARE betting Mahout on H2O Ted. I don't buy your denial. If
>> > Mahout moves to another faster better execution engine is will do so
>> only
>> > once in the immediate future. The only real alternative to your
>> proposal is
>> > a call to action for committers to move Mahout to Spark or other more
>> well
>> > known engine. These will realistically never coexist.
>> >
>> >
>> > Some other concerns:
>> >
>> > If H2O in only 2x as fast as Mahout on Spark I'd be dubious of adopting
>> an
>> > unknown or unproven platform. The fact that it is custom made for BD
>> > Analytics is both good and bad. It means that expertise we develop for
>> H2O
>> > may not be useful for other parallel computing problems. Also it seems
>> from
>> > the docs that the design point for 0xdata is not the same as Mahout.
>> 0xdata
>> > is trying to build a faster BD analytics platform (OLAP), not sparse
>> data
>> > machine learning in daily production. None of the things I use in Mahout
>> > are in 0xdata, I suspect because of this mismatch. It doesn't mean it
>> wont
>> > work but in lieu of the apples to apples comparison mentioned above it
>> does
>> > worry me.
>> >
>> > On Mar 14, 2014, at 7:21 AM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>> >
>> > > I think that the proposal under discussion involves adding a
>> dependency
>> > on
>> > > a maven released h2o artifact plus a contribution of Mahout
>> translation
>> > > layers.  These layers would give a sub-class of Matrix (and Vector)
>> which
>> > > allow direct control over life span across multiple jobs but would
>> > > otherwise behave like their in-memory counter-parts.
>> >
>> > Well I suppose that means they have to live in some processes which are
>> not
>> > processes I already have. And they have to be managed. So this is not
>> just
>> > an in-core subsystem. Sounds like a new back to me.
>> >
>> > >>
>> > >> In Hadoop, every iteration must be scheduled as a separate job,
>> rereads
>> > >> invariant data and materializes its result to hdfs. Therefore,
>> iterative
>> > >> programs on Hadoop are an order of magnitude slower than on systems
>> that
>> > >> have dedicated support for iterations.
>> > >>
>> > >> Does h2o help here or would we need to incorporate another system for
>> > such
>> > >> tasks?
>> > >>
>> > >
>> > > H2o helps here in a couple of different ways.
>> > >
>> > > The first and foremost is that primitive operations are easy
>> > > Additionally, data elements can survive a single programs execution.
>> >  This
>> > > means that programs can be executed one after another to get composite
>> > > effects.  This is astonishingly fast ... more along the speeds one
>> would
>> > > expect from a single processor program.
>> >
>> > I think the problem here is that the authors keep comparing these
>> > techniques to slowest model available which is Hadoop.
>> >
>> > But this is exact execution model of Spark. You get stuff repeatedly
>> > executed on in-memory partitions and get approximately the speed of
>> > iterative speed execution.  I won't describe it as astonishing, though,
>> > because indeed it is as fast as you can get things done in memory, no
>> > faster, no slower. That's for example the reason why my linalg
>> optimizer is
>> > not hesitating to compute exact matrix geometry lazily if not known, for
>> > optimization purposes, because the answer will be back in between 40 to
>> 200
>> > ms, assuming adequate RAM allocation. I have been using these paradigms
>> for
>> > more than a year now. This is all good stuff. I would not use word
>> > astonshing, but sensible, yes. Main concern is if programming model is
>> > called to be sacrificed just to do sensible things here.
>> >
>> > >
>> >
>> > >> (2) Efficient join implementations
>> > >>
>> > >> If we look at a lot of Mahout's algorithm implementations with a
>> > database
>> > >> hat on, than we see lots of handcoded joins in our codebase, because
>> > Hadoop
>> > >> does not bring join primitives. This has lots of drawbacks, e.g. it
>> > >> complicates the codebase and leads to hardcoded join strategies that
>> > bake
>> > >> certain assumptions into the code (e.g. ALS uses a broadcast-join
>> which
>> > >> assumes that one side fits into memory on each machine,
>> RecommenderJob
>> > uses
>> > >> a repartition-join which is scalable but very slow for small
>> > inputs,...).
>> > >>
>> >
>> > +1
>> >
>> > > I think that h2o provides this but do not know in detail how.  I do
>> know
>> > > that many of the algorithms already coded make use of matrix
>> > multiplication
>> > > which is essentially a join operation.
>> >
>> > Essentially a join? The spark module optimizer picks out of at least 3
>> > implementations: zip+combine, block-wise cartesian and finally, yes,
>> > join+combine. Depends on orientation and the earlier operators in
>> pipeline.
>> > That's exactly my point about flexibility of programming model from the
>> > optimizer point of view.
>> >
>> > >
>> > >> Obviously, I'd love to get rid of handcoded joins and implement ML
>> > >> algorithms (which is hard enough on its own). Other systems help with
>> > this
>> > >> already. Spark, for example offers broadcast and repartition-join
>> > >> primitives, Stratosphere has a join primitive and an optimizer that
>> > >> automatically decides which join strategy to use, as well as a highly
>> > >> optimized hybrid hashjoin implementation that can gracefully go
>> > out-of-core
>> > >> under memory pressure.
>> > >>
>> > >
>> > > When you get into the realm of things on this level of
>> sophistication, I
>> > > think that you have found the boundary where alternative foundations
>> like
>> > > Spark and Stratosphere are better than h2o.  The novelty with h2o is
>> the
>> > > hypothesis that a very large fraction of interesting ML algorithms
>> can be
>> > > implemented without this power.  So far, this seems correct.
>> >
>> > Again, this is largely along the lines "let's make a library of few
>> > hand-optimized things". Which is noble, but -- I would argue -- not
>> > ambitious enough. Most of the distributed ML projects do just that. We
>> > should perhaps think along the lines what could be differentiating
>> factor
>> > for us.
>> >
>> > Not that we should not care about performance. It should be, of course,
>> > *sensible*. (Our MR code base of course does not give us that, as u
>> said,
>> > jumping off MR wagon is not even a question).
>> >
>> > If you can forgive me for drawing parallels here, it's a difference
>> between
>> > something like Weka and R. Collection vs. platform _and_ collection
>> induced
>> > by platform. Platform of course also positively feeds into the speed of
>> > collection growth directly.
>> >
>> > When i use R, i don't have code consisting of algorithms calls. That is,
>> > yes, it is doing off-the shelf use now and then, but it is far from
>> being
>> > the only thing  it is doing. 95% of the things is as simple feature
>> > massaging. I place no value in R for providing GLM for me. Gosh, this
>> > particular offering is virtually hanging from anywhere these days.
>> >
>> > But i do place value into it for doing custom feature prep and for, for
>> > example being able to get 100 grad students to try their own k-means
>> > implementation in seconds.
>> >
>> > Why?
>> >
>> > There has been a lot of talk here about building community and
>> > contributions etc. Platform is what builds it, most directly and
>> amazingly.
>> > I would go on a limb here and say that Spark and mlib are experiencing
>> > explosive growth of contributions not because it can do things with
>> > in-memory datasets (which is important, but like i said, is has been
>> long
>> > since viewed no more than just sensible), but because of clarity of its
>> > programming model. I think we have seen a very solid evidence that
>> clarity
>> > and richness of programming model was the thing that attracts
>> communities.
>> >
>> > If we grade roughly (very roughly!) what we have today, I can easily
>> argue
>> > that the acceptance levels follow the programming model very closely.
>> e.g.
>> > if i try to sort project with distributed programming models by (my
>> > subjectively percieved) popularity, from bottom to top :
>> >
>> > ********
>> >
>> > Hadoop MapReduce -- ok i don't even know how to organize the critique
>> here,
>> > too long of a list, almost nobody (but Mahout) does these things this
>> way
>> > today. Certainly, none of my last 2 employers did.
>> >
>> > hive -- SQL like with severly constrained general programming language
>> > capabilities, not conducive to batches. Pretty much limits to ad-hoc
>> > exploration.
>> >
>> > Pig -- a bit better, can write batches, but extra functionality mixins
>> > (UDFs) are still a royal pain
>> >
>> > Cascading -- even easier, rich primitives, easy batches, some manual
>> > optimization of physical plan elements. One of the big cons is the
>> > limitation of a rigid dataset tuple structure,
>> >
>> > FlumeJava (Crunch in apache world) -- even better, but java closures are
>> > just plain ugly, zero "scriptability". Its community has been hurt a
>> little
>> > bit because of the fact that it was a bit late to the show compared to
>> > others (e.g. cascading), but it leveled off quickly.
>> >
>> > Scala bindings for Cascading (Scalding) and FlumeJava -- better, hell,
>> well
>> > better on the closure and FP front! But still not being native to scala
>> > from get go creates some miniature problems there.
>> >
>> > Spark -- i think is fair to say  the current community "king" above
>> those
>> > all -- all the aforementioned platform model pains are eliminated,
>> although
>> > on performance side i think there're still some pockets for improvement
>> on
>> > cost-based optimization side of things.
>> >
>> > Stratosphere might be more interesting in this department, but I am not
>> > sure at this point if that necessarily will translate into performance
>> > benefits for ML.
>> >
>> > ********
>> >
>> > The first few things are using the same computing model underneath and
>> > essentially are having roughly the same performance. Yet there's clear
>> > variation in community and acceptance.
>> >
>> > In ML world, we are seeing approximately the same thing. The clearer the
>> > programming model and ease of integration in to the process, the wider
>> the
>> > acceptance. I probably can pretty successfully argue that current most
>> > performant ML "thing" as it stands is GraphLab. And it is pretty
>> > comprehensive in problem coverage (I think it does cover e.g.
>> recommender
>> > concerns greater than h2o and Mahout together, for example). But i can
>> also
>> > pretty successfully argue it is being rejected a lot of time for being
>> just
>> > a collection (which is, in addition, is hard to call from jvm, i.e.
>> > integration again). It is actually so bad, that people in my company
>> would
>> > rather go back to 20 snow wired R servers than think of even
>> entertaining
>> > an architecture including GraphLab component. (Yes, variance of this
>> sample
>> > as high as it gets, just saying what i hear).
>> >
>> > So as a general guideline to solve the current ills, it would stand to
>> > reason to adopt platform priority and algorithm collection as a
>> function of
>> > such platform, rather than collection as a function of few dedicated
>> > efforts. Yes -- it has to be *sensibly* performant -- but this does not
>> > have to be mostly a concern of the code in this project directly.
>> Rather,
>> > it has to be a concern of the backs (i.e. dependencies) and our in-core
>> > support.
>> >
>> > Our pathological fear of being a performance scapegoat totally obscurs
>> the
>> > fact that performance is mostly a function of the back and that we were
>> > riding on a wrong back for a long time. As long as we don't cling to a
>> > particular back, it shouldn't be a problem. What one would rather
>> accept:
>> > being initially 5x slower than Graphlab (but on par with MLlib) but beat
>> > these on community support, or being on par but anemic in community? If
>> 02
>> > platform feels the performance has been so important to sacrifice
>> > programming model, why they feel the need to join an apache project?
>> After
>> > all, they have been an open project for a long time already and have
>> built
>> > their own community, big or small. Spark has just now become a top-level
>> > apache project, and joined apache incubator mere 2 months ago and did
>> not
>> > have any trouble attracting community outside Apache at all.
>> Stratosphere
>> > is not even in Apache. Similarly, did it help Mahout to be in Apache to
>> get
>> > anywhere close in community measurement to these? So this totally
>> refutes
>> > the argument one has to be an Apache project to get its exclusive
>> qualities
>> > highlighted. Perhaps in the end it is more about the importance of the
>> > qualities to the community and quality of contributions.
>> >
>> > A lot of this platform and programming model priority is probably
>> easier to
>> > say than do, but some of linalg and data frame things are ridiculously
>> easy
>> > though in terms of amount of effort. If i could do linalg optmizer with
>> > bindings for sparks with 2 nights a month, the same can be done for
>> > multiple backs and data frames in a jiffy. Well, the back should have a
>> > clear programming model of course as a prerequisite. Which brings us
>> back
>> > to the issue of richness of distributed primitives.
>> >
>> >
>>
>>
>> --
>> ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc
>> +1-408.316.8192
>>
>
>

Re: 0xdata interested in contributing

Reply via email to