On Thu, Mar 13, 2014 at 1:09 PM, Dmitriy Lyubimov <[email protected]> wrote:
> > with numerical computing in Mahout because the problems are different. > (To > > my mind, the key problems for numerical computing include: > > > > a) efficient, very fine-grained parallelism (think microseconds) > > > > b) efficient in-memory mutable storage > > > > c) no serialization of data between steps > > > > > > These problems are not even addressed by most data-flow architectures > (...) > > > -1 . b) and c) directly addressed by Spark and Stratosphere. all partitions > are mutable not only between fused operands, but also between different > pipelines if you instruct it do do so. There's no deserialization > happening if physical operator instructs the block manager to do so (and as > it happens that's exactly what it instructs to do by default). My > Implementation of say elementwise A*B or 5.0 *A is a mutable fused operand > that directly update matrix blocks. Reduce function looks like e.g. > reduceFunc = (a, b) => a *= b here (retaining modified a matrix block).[1] > Yes, > the block manager then slaps the blocks whith a new RDD id once the fused > sequence is finsihed, but they are not going anywhere and de-facto operand > is mutable. Are you sure you are familiar with the basics of these > engines? > I actually pretty sure that I am not as familiar as I need to be. At the same time, I am pretty sure that there is no direct support for fine-grained parallelism of the sort that h2o supports and I am pretty sure that there is no current code for keeping compressed forms of matrices that has comparable efficiency to the h2o code. The fine grained parallelism in h2o is done by capitalizing on the inherent capabilities of the JVM and by supporting a fork/join style which (insofar as I know) is fairly different from what Spark does.
