Re: 0xdata interested in contributing

Dmitriy Lyubimov Thu, 13 Mar 2014 13:10:27 -0700

Ted,

This isn't going to fix what is wrong

> with numerical computing in Mahout because the problems are different. (To
> my mind, the key problems for numerical computing include:
>
> a) efficient, very fine-grained parallelism (think microseconds)
>
> b) efficient in-memory mutable storage

> c) no serialization of data between steps
>

> These problems are not even addressed by most data-flow architectures
(...)

-1 . b) and c) directly addressed by Spark and Stratosphere. all partitions
are mutable not only between fused operands, but also between different
pipelines if you instruct it do do so. There's no deserialization
happening if physical operator instructs the block manager to do so (and as
it happens that's exactly what it instructs to do by default). My
Implementation of say elementwise A*B or 5.0 *A is a mutable fused operand
that directly update matrix blocks. Reduce function looks like e.g.
reduceFunc = (a, b) => a *= b here (retaining modified a matrix block).[1] Yes,
the block manager then slaps the blocks whith a new RDD id once the fused
sequence is finsihed, but they are not going anywhere and de-facto operand
is mutable. Are you sure you are familiar with the basics of these engines?

[1]
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/blas/AewB.scala

> The proof is in the pudding, I think. The 0xdata team think that they can
knock out a Mahout matrix and vector data type pretty quickly. They also
think that the SSVD algorithm will follow from that pretty
straightforwardly.

It is not a problem to write the algorithm. As it happens, the algorithm is
simple and is already written in formalisms. [2] The problem is (1) can it
be translated via physical operator layer to yet-another-engine, and (2)
why the heck do we need a new engine as a part of the project at all? Why
not to include MR as well, after all, majority of our solvers are written
for it specifically? If h2o was in open space for some time, how will its
embedment will help either H2o or Mahout?

A (few?) months ago before i started that effort, I specifically talked
over to you the philosophy of mahout -as-translation-layer and got your
full support. Accordingn to this philosophy, Mahout is not trying to merge
in Spark, MapReduce, or whatever layer, etc.etc. but just leverage them
thru a translation layer. Algorithms in this environment are thus not
tightly coupled to the backend. There's a cost based optimization
techniques and physical operator set specific to an engine, but there's no
change in representation of logical evaluation of algorithm. This notion of
Mahout devouring a distributed engine flies directly in the face of this
already discussed philosophy. They simply can't coexist.

[2]
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/drm/decompositions/DSSVD.scala

And you still haven't addressed my most sticking question of how you are
planning to address the Mahout homogeneity with this contribution, assuming
this is not simple rebranding effort.

I guess current phylosophy, as i have been seeing it till this moment:

(a) Mahout is nothing but a translation layer with respect to backend
primitives.
(b) Mahout provides in-core support for matrices, and perhaps, data frames,
to run both in front and back as needed.
(c) Most importantly, Mahout creates a semantically impeccable environment
for algorithm developers and decouples them from the knowledge (or
low-level operation) of the backend. My best current approximation to this
at this point is, again, [2].
(d) Such environment is also algorithmically sound. (i.e. it has to be a
clean and performant functional programming envronment, preferrably
supporting scripting as well, but not just some sort of domain specific
language such as SQL.
(e) it linalg aspects it is damn close to R or existing environment (since
we are trying to push a new things on same crowd accustomed to R type of
things).
(d) and, most importantly, stop throwing in new algorithms just for the
sake of throwing them. Instead, enable building them and using them.

So, why e.g. MLI doesn't quite fit this vision? a: Tightly coupled with
Spark. No coherent in-core/out-of-core linalg support. But MLI goes to show
people go along these lines these days (there are more projects breeding).
Without these steps Mahout will not escape its major criticism: Just a
library of rigidly built algorithms. Hard to use. Hard to develop on top.
Hard to customize. Hard to validate.

There are a few items to consider as possible developer's stories.

(1) scala dsl fits nicely on all requirements. No parsers, no semantic
trees, mixed environment of strong functional language and DSL
capabilities, if so needed, interactive shell/script engine (including
on-the-fly byte class parsing in to the byte code, so no even
cost-of-iteration penalties here!)

(2) in-core performance (if it is even a concern). Matrix abstraction can
evolve to include JBlas and GPU-based data sets. In terms of performance,
latest conference papers on GPU approach demonstrate that GPU-stored
mutable datasets will blow socks off anything written with CPU and RAM bus.
In fact, reading some of these conference papers makes me at some point
wonder if linear algebra has a future in distribubted computations at all,
at some point. A week worth of work to incorporate any of this work under
Matrix Hood. Rent GPU nodes from EC2. That is, in case one thinks there's a
performance issue that outweighs semantic clarity of the algorithm design.
Not an issue whatsoever. Certainly not at the cost of environment.

(3) Every multinode system (even allreduce) incurs serialized I/O. So yes,
*maybe* our matrices could use a better compression -- although I am
dubious about it if cost switch to sparse algebra is properly applied in
the optimizer. So there may be valuable contributions, but it is not
architecture changing thing.

(4) Couple days of work to throw in Stratosphere primitives.

(5) develop the same for data frames.

(6) fire off algorithm developers. In two weeks of dedicated time (assuming
they have the time to dedicate) they will be beyond horizon in sum of
accomplishments. Which one can actually read.

Net remainder, there are very few good things in this merger to existing
vision as discussed. The biggest one is of course fighting general anemic
state of the project with a side investment at the cost of the vision

Also, i support all Sebastian's questions. I am dubious you provided good
answers to any of them. I am dubious on project homogeneity. I am dubious
on physical operator set offer. I am dubious on compatibility with any of
existing code. Finally, i am dubious on general philosophy. Until this is
all visualized, it is hard for me to make any vote on this. As it stands,
it is -1 overall. Driven by pragmatical considerations, once outweighed, i
probably should try to take this philosophy elsewhere in hope of finding
more closely aligned pragmatical interests.

-d

Re: 0xdata interested in contributing

Reply via email to