Isn’t there some work on RSJ on Spark? Can we compare that to something 0xdata can “knock off”?
On Mar 14, 2014, at 10:08 AM, Sebastian Schelter <[email protected]> wrote: Dmitriy, I share a lot your concerns expressed here. I hear more complaints about Mahout being too inaccessible and too hard to customize for use cases and inputs more than complaints about it being too slow. I also concur with your analysis that the clear and accessible programming model is what causes Spark's popularity. I'm also not a fan of sacrificing a programming model for performance, I also consider this the main drawback of Graphlab. Its superfast for a certain set of problems, but it constrains you to a vertex centric programming model, into which a lot of things hardly fit. On 03/14/2014 03:21 PM, Dmitriy Lyubimov wrote: >> I think that the proposal under discussion involves adding a dependency on >> a maven released h2o artifact plus a contribution of Mahout translation >> layers. These layers would give a sub-class of Matrix (and Vector) which >> allow direct control over life span across multiple jobs but would >> otherwise behave like their in-memory counter-parts. > > Well I suppose that means they have to live in some processes which are not > processes I already have. And they have to be managed. So this is not just > an in-core subsystem. Sounds like a new back to me. > >>> >>> In Hadoop, every iteration must be scheduled as a separate job, rereads >>> invariant data and materializes its result to hdfs. Therefore, iterative >>> programs on Hadoop are an order of magnitude slower than on systems that >>> have dedicated support for iterations. >>> >>> Does h2o help here or would we need to incorporate another system for > such >>> tasks? >>> >> >> H2o helps here in a couple of different ways. >> >> The first and foremost is that primitive operations are easy >> Additionally, data elements can survive a single programs execution. This >> means that programs can be executed one after another to get composite >> effects. This is astonishingly fast ... more along the speeds one would >> expect from a single processor program. > > I think the problem here is that the authors keep comparing these > techniques to slowest model available which is Hadoop. > > But this is exact execution model of Spark. You get stuff repeatedly > executed on in-memory partitions and get approximately the speed of > iterative speed execution. I won't describe it as astonishing, though, > because indeed it is as fast as you can get things done in memory, no > faster, no slower. That's for example the reason why my linalg optimizer is > not hesitating to compute exact matrix geometry lazily if not known, for > optimization purposes, because the answer will be back in between 40 to 200 > ms, assuming adequate RAM allocation. I have been using these paradigms for > more than a year now. This is all good stuff. I would not use word > astonshing, but sensible, yes. Main concern is if programming model is > called to be sacrificed just to do sensible things here. > >> > >>> (2) Efficient join implementations >>> >>> If we look at a lot of Mahout's algorithm implementations with a > database >>> hat on, than we see lots of handcoded joins in our codebase, because > Hadoop >>> does not bring join primitives. This has lots of drawbacks, e.g. it >>> complicates the codebase and leads to hardcoded join strategies that > bake >>> certain assumptions into the code (e.g. ALS uses a broadcast-join which >>> assumes that one side fits into memory on each machine, RecommenderJob > uses >>> a repartition-join which is scalable but very slow for small > inputs,...). >>> > > +1 > >> I think that h2o provides this but do not know in detail how. I do know >> that many of the algorithms already coded make use of matrix > multiplication >> which is essentially a join operation. > > Essentially a join? The spark module optimizer picks out of at least 3 > implementations: zip+combine, block-wise cartesian and finally, yes, > join+combine. Depends on orientation and the earlier operators in pipeline. > That's exactly my point about flexibility of programming model from the > optimizer point of view. > >> >>> Obviously, I'd love to get rid of handcoded joins and implement ML >>> algorithms (which is hard enough on its own). Other systems help with > this >>> already. Spark, for example offers broadcast and repartition-join >>> primitives, Stratosphere has a join primitive and an optimizer that >>> automatically decides which join strategy to use, as well as a highly >>> optimized hybrid hashjoin implementation that can gracefully go > out-of-core >>> under memory pressure. >>> >> >> When you get into the realm of things on this level of sophistication, I >> think that you have found the boundary where alternative foundations like >> Spark and Stratosphere are better than h2o. The novelty with h2o is the >> hypothesis that a very large fraction of interesting ML algorithms can be >> implemented without this power. So far, this seems correct. > > Again, this is largely along the lines "let's make a library of few > hand-optimized things". Which is noble, but -- I would argue -- not > ambitious enough. Most of the distributed ML projects do just that. We > should perhaps think along the lines what could be differentiating factor > for us. > > Not that we should not care about performance. It should be, of course, > *sensible*. (Our MR code base of course does not give us that, as u said, > jumping off MR wagon is not even a question). > > If you can forgive me for drawing parallels here, it's a difference between > something like Weka and R. Collection vs. platform _and_ collection induced > by platform. Platform of course also positively feeds into the speed of > collection growth directly. > > When i use R, i don't have code consisting of algorithms calls. That is, > yes, it is doing off-the shelf use now and then, but it is far from being > the only thing it is doing. 95% of the things is as simple feature > massaging. I place no value in R for providing GLM for me. Gosh, this > particular offering is virtually hanging from anywhere these days. > > But i do place value into it for doing custom feature prep and for, for > example being able to get 100 grad students to try their own k-means > implementation in seconds. > > Why? > > There has been a lot of talk here about building community and > contributions etc. Platform is what builds it, most directly and amazingly. > I would go on a limb here and say that Spark and mlib are experiencing > explosive growth of contributions not because it can do things with > in-memory datasets (which is important, but like i said, is has been long > since viewed no more than just sensible), but because of clarity of its > programming model. I think we have seen a very solid evidence that clarity > and richness of programming model was the thing that attracts communities. > > If we grade roughly (very roughly!) what we have today, I can easily argue > that the acceptance levels follow the programming model very closely. e.g. > if i try to sort project with distributed programming models by (my > subjectively percieved) popularity, from bottom to top : > > ******** > > Hadoop MapReduce -- ok i don't even know how to organize the critique here, > too long of a list, almost nobody (but Mahout) does these things this way > today. Certainly, none of my last 2 employers did. > > hive -- SQL like with severly constrained general programming language > capabilities, not conducive to batches. Pretty much limits to ad-hoc > exploration. > > Pig -- a bit better, can write batches, but extra functionality mixins > (UDFs) are still a royal pain > > Cascading -- even easier, rich primitives, easy batches, some manual > optimization of physical plan elements. One of the big cons is the > limitation of a rigid dataset tuple structure, > > FlumeJava (Crunch in apache world) -- even better, but java closures are > just plain ugly, zero "scriptability". Its community has been hurt a little > bit because of the fact that it was a bit late to the show compared to > others (e.g. cascading), but it leveled off quickly. > > Scala bindings for Cascading (Scalding) and FlumeJava -- better, hell, well > better on the closure and FP front! But still not being native to scala > from get go creates some miniature problems there. > > Spark -- i think is fair to say the current community "king" above those > all -- all the aforementioned platform model pains are eliminated, although > on performance side i think there're still some pockets for improvement on > cost-based optimization side of things. > > Stratosphere might be more interesting in this department, but I am not > sure at this point if that necessarily will translate into performance > benefits for ML. > > ******** > > The first few things are using the same computing model underneath and > essentially are having roughly the same performance. Yet there's clear > variation in community and acceptance. > > In ML world, we are seeing approximately the same thing. The clearer the > programming model and ease of integration in to the process, the wider the > acceptance. I probably can pretty successfully argue that current most > performant ML "thing" as it stands is GraphLab. And it is pretty > comprehensive in problem coverage (I think it does cover e.g. recommender > concerns greater than h2o and Mahout together, for example). But i can also > pretty successfully argue it is being rejected a lot of time for being just > a collection (which is, in addition, is hard to call from jvm, i.e. > integration again). It is actually so bad, that people in my company would > rather go back to 20 snow wired R servers than think of even entertaining > an architecture including GraphLab component. (Yes, variance of this sample > as high as it gets, just saying what i hear). > > So as a general guideline to solve the current ills, it would stand to > reason to adopt platform priority and algorithm collection as a function of > such platform, rather than collection as a function of few dedicated > efforts. Yes -- it has to be *sensibly* performant -- but this does not > have to be mostly a concern of the code in this project directly. Rather, > it has to be a concern of the backs (i.e. dependencies) and our in-core > support. > > Our pathological fear of being a performance scapegoat totally obscurs the > fact that performance is mostly a function of the back and that we were > riding on a wrong back for a long time. As long as we don't cling to a > particular back, it shouldn't be a problem. What one would rather accept: > being initially 5x slower than Graphlab (but on par with MLlib) but beat > these on community support, or being on par but anemic in community? If 02 > platform feels the performance has been so important to sacrifice > programming model, why they feel the need to join an apache project? After > all, they have been an open project for a long time already and have built > their own community, big or small. Spark has just now become a top-level > apache project, and joined apache incubator mere 2 months ago and did not > have any trouble attracting community outside Apache at all. Stratosphere > is not even in Apache. Similarly, did it help Mahout to be in Apache to get > anywhere close in community measurement to these? So this totally refutes > the argument one has to be an Apache project to get its exclusive qualities > highlighted. Perhaps in the end it is more about the importance of the > qualities to the community and quality of contributions. > > A lot of this platform and programming model priority is probably easier to > say than do, but some of linalg and data frame things are ridiculously easy > though in terms of amount of effort. If i could do linalg optmizer with > bindings for sparks with 2 nights a month, the same can be done for > multiple backs and data frames in a jiffy. Well, the back should have a > clear programming model of course as a prerequisite. Which brings us back > to the issue of richness of distributed primitives. >
