Isn’t there some work on RSJ on Spark? Can we compare that to something 0xdata 
can “knock off”?


On Mar 14, 2014, at 10:08 AM, Sebastian Schelter <[email protected]> wrote:

Dmitriy,

I share a lot your concerns expressed here. I hear more complaints about Mahout 
being too inaccessible and too hard to customize for use cases and inputs more 
than complaints about it being too slow. I also concur with your analysis that 
the clear and accessible programming model is what causes Spark's popularity.

I'm also not a fan of sacrificing a programming model for performance, I also 
consider this the main drawback of Graphlab. Its superfast for a certain set of 
problems, but it constrains you to a vertex centric programming model, into 
which a lot of things hardly fit.



On 03/14/2014 03:21 PM, Dmitriy Lyubimov wrote:
>> I think that the proposal under discussion involves adding a dependency on
>> a maven released h2o artifact plus a contribution of Mahout translation
>> layers.  These layers would give a sub-class of Matrix (and Vector) which
>> allow direct control over life span across multiple jobs but would
>> otherwise behave like their in-memory counter-parts.
> 
> Well I suppose that means they have to live in some processes which are not
> processes I already have. And they have to be managed. So this is not just
> an in-core subsystem. Sounds like a new back to me.
> 
>>> 
>>> In Hadoop, every iteration must be scheduled as a separate job, rereads
>>> invariant data and materializes its result to hdfs. Therefore, iterative
>>> programs on Hadoop are an order of magnitude slower than on systems that
>>> have dedicated support for iterations.
>>> 
>>> Does h2o help here or would we need to incorporate another system for
> such
>>> tasks?
>>> 
>> 
>> H2o helps here in a couple of different ways.
>> 
>> The first and foremost is that primitive operations are easy
>> Additionally, data elements can survive a single programs execution.  This
>> means that programs can be executed one after another to get composite
>> effects.  This is astonishingly fast ... more along the speeds one would
>> expect from a single processor program.
> 
> I think the problem here is that the authors keep comparing these
> techniques to slowest model available which is Hadoop.
> 
> But this is exact execution model of Spark. You get stuff repeatedly
> executed on in-memory partitions and get approximately the speed of
> iterative speed execution.  I won't describe it as astonishing, though,
> because indeed it is as fast as you can get things done in memory, no
> faster, no slower. That's for example the reason why my linalg optimizer is
> not hesitating to compute exact matrix geometry lazily if not known, for
> optimization purposes, because the answer will be back in between 40 to 200
> ms, assuming adequate RAM allocation. I have been using these paradigms for
> more than a year now. This is all good stuff. I would not use word
> astonshing, but sensible, yes. Main concern is if programming model is
> called to be sacrificed just to do sensible things here.
> 
>> 
> 
>>> (2) Efficient join implementations
>>> 
>>> If we look at a lot of Mahout's algorithm implementations with a
> database
>>> hat on, than we see lots of handcoded joins in our codebase, because
> Hadoop
>>> does not bring join primitives. This has lots of drawbacks, e.g. it
>>> complicates the codebase and leads to hardcoded join strategies that
> bake
>>> certain assumptions into the code (e.g. ALS uses a broadcast-join which
>>> assumes that one side fits into memory on each machine, RecommenderJob
> uses
>>> a repartition-join which is scalable but very slow for small
> inputs,...).
>>> 
> 
> +1
> 
>> I think that h2o provides this but do not know in detail how.  I do know
>> that many of the algorithms already coded make use of matrix
> multiplication
>> which is essentially a join operation.
> 
> Essentially a join? The spark module optimizer picks out of at least 3
> implementations: zip+combine, block-wise cartesian and finally, yes,
> join+combine. Depends on orientation and the earlier operators in pipeline.
> That's exactly my point about flexibility of programming model from the
> optimizer point of view.
> 
>> 
>>> Obviously, I'd love to get rid of handcoded joins and implement ML
>>> algorithms (which is hard enough on its own). Other systems help with
> this
>>> already. Spark, for example offers broadcast and repartition-join
>>> primitives, Stratosphere has a join primitive and an optimizer that
>>> automatically decides which join strategy to use, as well as a highly
>>> optimized hybrid hashjoin implementation that can gracefully go
> out-of-core
>>> under memory pressure.
>>> 
>> 
>> When you get into the realm of things on this level of sophistication, I
>> think that you have found the boundary where alternative foundations like
>> Spark and Stratosphere are better than h2o.  The novelty with h2o is the
>> hypothesis that a very large fraction of interesting ML algorithms can be
>> implemented without this power.  So far, this seems correct.
> 
> Again, this is largely along the lines "let's make a library of few
> hand-optimized things". Which is noble, but -- I would argue -- not
> ambitious enough. Most of the distributed ML projects do just that. We
> should perhaps think along the lines what could be differentiating factor
> for us.
> 
> Not that we should not care about performance. It should be, of course,
> *sensible*. (Our MR code base of course does not give us that, as u said,
> jumping off MR wagon is not even a question).
> 
> If you can forgive me for drawing parallels here, it's a difference between
> something like Weka and R. Collection vs. platform _and_ collection induced
> by platform. Platform of course also positively feeds into the speed of
> collection growth directly.
> 
> When i use R, i don't have code consisting of algorithms calls. That is,
> yes, it is doing off-the shelf use now and then, but it is far from being
> the only thing  it is doing. 95% of the things is as simple feature
> massaging. I place no value in R for providing GLM for me. Gosh, this
> particular offering is virtually hanging from anywhere these days.
> 
> But i do place value into it for doing custom feature prep and for, for
> example being able to get 100 grad students to try their own k-means
> implementation in seconds.
> 
> Why?
> 
> There has been a lot of talk here about building community and
> contributions etc. Platform is what builds it, most directly and amazingly.
> I would go on a limb here and say that Spark and mlib are experiencing
> explosive growth of contributions not because it can do things with
> in-memory datasets (which is important, but like i said, is has been long
> since viewed no more than just sensible), but because of clarity of its
> programming model. I think we have seen a very solid evidence that clarity
> and richness of programming model was the thing that attracts communities.
> 
> If we grade roughly (very roughly!) what we have today, I can easily argue
> that the acceptance levels follow the programming model very closely. e.g.
> if i try to sort project with distributed programming models by (my
> subjectively percieved) popularity, from bottom to top :
> 
> ********
> 
> Hadoop MapReduce -- ok i don't even know how to organize the critique here,
> too long of a list, almost nobody (but Mahout) does these things this way
> today. Certainly, none of my last 2 employers did.
> 
> hive -- SQL like with severly constrained general programming language
> capabilities, not conducive to batches. Pretty much limits to ad-hoc
> exploration.
> 
> Pig -- a bit better, can write batches, but extra functionality mixins
> (UDFs) are still a royal pain
> 
> Cascading -- even easier, rich primitives, easy batches, some manual
> optimization of physical plan elements. One of the big cons is the
> limitation of a rigid dataset tuple structure,
> 
> FlumeJava (Crunch in apache world) -- even better, but java closures are
> just plain ugly, zero "scriptability". Its community has been hurt a little
> bit because of the fact that it was a bit late to the show compared to
> others (e.g. cascading), but it leveled off quickly.
> 
> Scala bindings for Cascading (Scalding) and FlumeJava -- better, hell, well
> better on the closure and FP front! But still not being native to scala
> from get go creates some miniature problems there.
> 
> Spark -- i think is fair to say  the current community "king" above those
> all -- all the aforementioned platform model pains are eliminated, although
> on performance side i think there're still some pockets for improvement on
> cost-based optimization side of things.
> 
> Stratosphere might be more interesting in this department, but I am not
> sure at this point if that necessarily will translate into performance
> benefits for ML.
> 
> ********
> 
> The first few things are using the same computing model underneath and
> essentially are having roughly the same performance. Yet there's clear
> variation in community and acceptance.
> 
> In ML world, we are seeing approximately the same thing. The clearer the
> programming model and ease of integration in to the process, the wider the
> acceptance. I probably can pretty successfully argue that current most
> performant ML "thing" as it stands is GraphLab. And it is pretty
> comprehensive in problem coverage (I think it does cover e.g. recommender
> concerns greater than h2o and Mahout together, for example). But i can also
> pretty successfully argue it is being rejected a lot of time for being just
> a collection (which is, in addition, is hard to call from jvm, i.e.
> integration again). It is actually so bad, that people in my company would
> rather go back to 20 snow wired R servers than think of even entertaining
> an architecture including GraphLab component. (Yes, variance of this sample
> as high as it gets, just saying what i hear).
> 
> So as a general guideline to solve the current ills, it would stand to
> reason to adopt platform priority and algorithm collection as a function of
> such platform, rather than collection as a function of few dedicated
> efforts. Yes -- it has to be *sensibly* performant -- but this does not
> have to be mostly a concern of the code in this project directly. Rather,
> it has to be a concern of the backs (i.e. dependencies) and our in-core
> support.
> 
> Our pathological fear of being a performance scapegoat totally obscurs the
> fact that performance is mostly a function of the back and that we were
> riding on a wrong back for a long time. As long as we don't cling to a
> particular back, it shouldn't be a problem. What one would rather accept:
> being initially 5x slower than Graphlab (but on par with MLlib) but beat
> these on community support, or being on par but anemic in community? If 02
> platform feels the performance has been so important to sacrifice
> programming model, why they feel the need to join an apache project? After
> all, they have been an open project for a long time already and have built
> their own community, big or small. Spark has just now become a top-level
> apache project, and joined apache incubator mere 2 months ago and did not
> have any trouble attracting community outside Apache at all. Stratosphere
> is not even in Apache. Similarly, did it help Mahout to be in Apache to get
> anywhere close in community measurement to these? So this totally refutes
> the argument one has to be an Apache project to get its exclusive qualities
> highlighted. Perhaps in the end it is more about the importance of the
> qualities to the community and quality of contributions.
> 
> A lot of this platform and programming model priority is probably easier to
> say than do, but some of linalg and data frame things are ridiculously easy
> though in terms of amount of effort. If i could do linalg optmizer with
> bindings for sparks with 2 nights a month, the same can be done for
> multiple backs and data frames in a jiffy. Well, the back should have a
> clear programming model of course as a prerequisite. Which brings us back
> to the issue of richness of distributed primitives.
> 


Reply via email to