Ted,
This isn't going to fix what is wrong > with numerical computing in Mahout because the problems are different. (To > my mind, the key problems for numerical computing include: > > a) efficient, very fine-grained parallelism (think microseconds) > > b) efficient in-memory mutable storage > c) no serialization of data between steps > > These problems are not even addressed by most data-flow architectures (...) -1 . b) and c) directly addressed by Spark and Stratosphere. all partitions are mutable not only between fused operands, but also between different pipelines if you instruct it do do so. There's no deserialization happening if physical operator instructs the block manager to do so (and as it happens that's exactly what it instructs to do by default). My Implementation of say elementwise A*B or 5.0 *A is a mutable fused operand that directly update matrix blocks. Reduce function looks like e.g. reduceFunc = (a, b) => a *= b here (retaining modified a matrix block).[1] Yes, the block manager then slaps the blocks whith a new RDD id once the fused sequence is finsihed, but they are not going anywhere and de-facto operand is mutable. Are you sure you are familiar with the basics of these engines? [1] https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AewB.scala > The proof is in the pudding, I think. The 0xdata team think that they can knock out a Mahout matrix and vector data type pretty quickly. They also think that the SSVD algorithm will follow from that pretty straightforwardly. It is not a problem to write the algorithm. As it happens, the algorithm is simple and is already written in formalisms. [2] The problem is (1) can it be translated via physical operator layer to yet-another-engine, and (2) why the heck do we need a new engine as a part of the project at all? Why not to include MR as well, after all, majority of our solvers are written for it specifically? If h2o was in open space for some time, how will its embedment will help either H2o or Mahout? A (few?) months ago before i started that effort, I specifically talked over to you the philosophy of mahout -as-translation-layer and got your full support. Accordingn to this philosophy, Mahout is not trying to merge in Spark, MapReduce, or whatever layer, etc.etc. but just leverage them thru a translation layer. Algorithms in this environment are thus not tightly coupled to the backend. There's a cost based optimization techniques and physical operator set specific to an engine, but there's no change in representation of logical evaluation of algorithm. This notion of Mahout devouring a distributed engine flies directly in the face of this already discussed philosophy. They simply can't coexist. [2] https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/decompositions/DSSVD.scala And you still haven't addressed my most sticking question of how you are planning to address the Mahout homogeneity with this contribution, assuming this is not simple rebranding effort. I guess current phylosophy, as i have been seeing it till this moment: (a) Mahout is nothing but a translation layer with respect to backend primitives. (b) Mahout provides in-core support for matrices, and perhaps, data frames, to run both in front and back as needed. (c) Most importantly, Mahout creates a semantically impeccable environment for algorithm developers and decouples them from the knowledge (or low-level operation) of the backend. My best current approximation to this at this point is, again, [2]. (d) Such environment is also algorithmically sound. (i.e. it has to be a clean and performant functional programming envronment, preferrably supporting scripting as well, but not just some sort of domain specific language such as SQL. (e) it linalg aspects it is damn close to R or existing environment (since we are trying to push a new things on same crowd accustomed to R type of things). (d) and, most importantly, stop throwing in new algorithms just for the sake of throwing them. Instead, enable building them and using them. So, why e.g. MLI doesn't quite fit this vision? a: Tightly coupled with Spark. No coherent in-core/out-of-core linalg support. But MLI goes to show people go along these lines these days (there are more projects breeding). Without these steps Mahout will not escape its major criticism: Just a library of rigidly built algorithms. Hard to use. Hard to develop on top. Hard to customize. Hard to validate. There are a few items to consider as possible developer's stories. (1) scala dsl fits nicely on all requirements. No parsers, no semantic trees, mixed environment of strong functional language and DSL capabilities, if so needed, interactive shell/script engine (including on-the-fly byte class parsing in to the byte code, so no even cost-of-iteration penalties here!) (2) in-core performance (if it is even a concern). Matrix abstraction can evolve to include JBlas and GPU-based data sets. In terms of performance, latest conference papers on GPU approach demonstrate that GPU-stored mutable datasets will blow socks off anything written with CPU and RAM bus. In fact, reading some of these conference papers makes me at some point wonder if linear algebra has a future in distribubted computations at all, at some point. A week worth of work to incorporate any of this work under Matrix Hood. Rent GPU nodes from EC2. That is, in case one thinks there's a performance issue that outweighs semantic clarity of the algorithm design. Not an issue whatsoever. Certainly not at the cost of environment. (3) Every multinode system (even allreduce) incurs serialized I/O. So yes, *maybe* our matrices could use a better compression -- although I am dubious about it if cost switch to sparse algebra is properly applied in the optimizer. So there may be valuable contributions, but it is not architecture changing thing. (4) Couple days of work to throw in Stratosphere primitives. (5) develop the same for data frames. (6) fire off algorithm developers. In two weeks of dedicated time (assuming they have the time to dedicate) they will be beyond horizon in sum of accomplishments. Which one can actually read. Net remainder, there are very few good things in this merger to existing vision as discussed. The biggest one is of course fighting general anemic state of the project with a side investment at the cost of the vision Also, i support all Sebastian's questions. I am dubious you provided good answers to any of them. I am dubious on project homogeneity. I am dubious on physical operator set offer. I am dubious on compatibility with any of existing code. Finally, i am dubious on general philosophy. Until this is all visualized, it is hard for me to make any vote on this. As it stands, it is -1 overall. Driven by pragmatical considerations, once outweighed, i probably should try to take this philosophy elsewhere in hope of finding more closely aligned pragmatical interests. -d
