On Thu, Mar 13, 2014 at 1:09 PM, Dmitriy Lyubimov <[email protected]> wrote:

> I guess current phylosophy, as i have been seeing it till this moment:
>
> (a) Mahout is nothing but a translation layer with respect to backend
> primitives.
>

Well, there ought to be some implementation above this translation layer.
 We probably agree on that point.


>  (b) Mahout provides in-core support for matrices, and perhaps, data
> frames,
> to run both in front and back as needed.
>

Sounds good.  And in accord with the 0xdata proposal.


>  (c) Most importantly, Mahout creates a semantically impeccable environment
> for algorithm developers and decouples them from the knowledge (or
> low-level operation) of the backend. My best current approximation to this
> at this point is, again, [2].
>

This sounds like a good idea.


>  (d) Such environment is also algorithmically sound. (i.e. it has to be a
> clean and performant functional programming envronment, preferrably
> supporting scripting as well, but not just some sort of domain specific
> language such as SQL.
>

The first sentence is good.  I am not clear that a full-scale functional
programming environment is necessary in order to support linear algebra.  I
agree that SQL isn't going to help us much.


> (e) it linalg aspects it is damn close to R or existing environment (since
> we are trying to push a new things on same crowd accustomed to R type of
> things).
>

I think that a corollary here is that there be some buy in from the
existing community outside of Mahout.  Initial positive reception by this
community is an indication that this is working.



> (d) and, most importantly, stop throwing in new algorithms just for the
> sake of throwing them. Instead, enable building them and using them.
>

This is a fine thing.


>
> So, why e.g. MLI doesn't quite fit this vision?
>

The biggest problem here is that the MLI community isn't offering to join
forces with us to help out.


> a: Tightly coupled with
> Spark. No coherent in-core/out-of-core linalg support.
>

Ah... well that does seem to be a sticking point as well.


> But MLI goes to show
> people go along these lines these days (there are more projects breeding).
> Without these steps Mahout will not escape its major criticism: Just a
> library of rigidly built algorithms. Hard to use. Hard to develop on top.
> Hard to customize. Hard to validate.
>

Without which steps?  I agree all the way down to here, but then lose your
thread.


>
> There are a few items to consider as possible developer's stories.
>
> (1) scala dsl fits nicely on all requirements. No parsers, no semantic
> trees, mixed environment of strong functional language and DSL
> capabilities, if so needed, interactive shell/script engine (including
> on-the-fly byte class parsing in to the byte code, so no even
> cost-of-iteration penalties here!)
>

This sounds great.  The only thing lacking (right now) is the ability to do
byte code transformations to get really high performance.


>
> (2) in-core performance (if it is even a concern). Matrix abstraction can
> evolve to include JBlas and GPU-based data sets. In terms of performance,
> latest conference papers on GPU approach demonstrate that GPU-stored
> mutable datasets will blow socks off anything written with CPU and RAM bus.
>

GPU's have been hobbled up to recently by poor main memory bandwidth.  For
problems that can fit inside GPU memory (possibly multiplied by
parallelism) performance can be very good.

Nothing prevents a system like h2o from taking advantage of GPU's.  It
already matches and often exceeds the speed of JBlas.  Having a persistent
parallel representation for GPU resident data sets would be very
interesting for appropriate problems.

GPU's, however, show very different characteristics with highly sparse
problems.  There, the amount of arithmetic per memory operation is
dramatically lower than problems like deep learning and so GPU's provide
considerably less advantage.



> ...
>
> (3) Every multinode system (even allreduce) incurs serialized I/O. So yes,
> *maybe* our matrices could use a better compression -- although I am
> dubious about it if cost switch to sparse algebra is properly applied in
> the optimizer. So there may be valuable contributions, but it is not
> architecture changing thing.
>

not sure the point here.


>
> (4) Couple days of work to throw in Stratosphere primitives.
>

Likewise.  If the stratosphere community would like to step up to help with
this, I would champion that contribution as well.


> (5) develop the same for data frames.
>
> (6) fire off algorithm developers. In two weeks of dedicated time (assuming
> they have the time to dedicate) they will be beyond horizon in sum of
> accomplishments. Which one can actually read.
>

I don't see where we disagree.


> Net remainder, there are very few good things in this merger to existing
> vision as discussed. The biggest one  is of course fighting general anemic
> state of the project with a side investment at the cost of the vision
>

I think that this is a straw man argument.  I am not proposing a full scale
merger at this point.  I am proposing that 0xdata be encouraged to
contribute a Mahout binding and implementation for a number of key
algorithms.


>
> Also, i support all Sebastian's questions. I am dubious you provided good
> answers to any of them.
>

I am working on providing those answers.


> I am dubious on project homogeneity.
>

Not sure why.  The proposal is intended to push forward with existing API's.


> I am dubious
> on physical operator set offer.
>

Dubious in that you don't think they are sufficient?  Or what?



> I am dubious on compatibility with any of
> existing code.
>

There is an offer on the table to implement that compatibility.

Reply via email to