Re: Codebase refactoring proposal

Dmitriy Lyubimov Wed, 04 Feb 2015 12:38:50 -0800

Re: Gokhan's PR post: here are my thoughts but i did not want to post it
there since they are going beyond the scope of that PR's work to chase the
root of the issue.


on quasi-algebraic methods
========================

What is the dilemma here? don't see any.

I already explained that no more than 25% of algorithms are truly 100%
algebraic. But about 80% cannot avoid using some algebra and close to 95%
could benefit from using algebra (even stochastic and monte carlo stuff).

So we are building system that allows us to cut developer's work by at
least 60% and make his work also more readable by 3000%. As far as I am
concerned, that fulfills the goal. And I am perfectly happy writing a mix
of engine-specific primitives and algebra.

That's why i am a bit skeptical about attempts to abstract non-algebraic
primitives such as row-wise aggregators in one of the pull requests.
Engine-specific primitives and algebra can perfectly co-exist in the guts.
And that's how i am doing my stuff in practice, except i now can skip 80%
effort on algebra and bridging incompatible intputs-outputs.

None of that means that R-like algebra cannot be engine agnostic. So people
are unhappy about not being able to write the whole in totaly agnostic way?
And so they (falsely) infer the pieces of their work cannot be helped by
agnosticism individually, or the tools are not being as good as they might
be without backend agnosticism? Sorry, but I fail to see the logic there.

We proved algebra can be agnostic. I don't think this notion should be
disputed.

And even if there were a shred of real benefit by making algebra tools
un-agnostic, it would not ever outweigh tons of good we could get for the
project by integrating with e.g. Flink folks. This one one the points MLLib
will never be able to overcome -- to be truly shared ML platform where
people could create and share ML, but not just a bunch of ad-hoc spaghetty
of distributed api calls and Spark-nailed black boxes.

Well yes methodology implementations will still have native distributed
calls. Just not nearly as many as they otherwise would, and will be much
more easier to support on another back-end using Strategy patterns. E.g.
implicit feedback problem that i originally wrote as quasi-method for Spark
only, would've taken just an hour or so to add strategy for flink, since it
retains all in-core and distributed algebra work as is.

Not to mention benefit of single type pipelining.

And once we add hardware-accelerated bindings for in-core stuff, all these
methods would immediately benefit from it.

On MLLib interoperability issues,
=========================

well, let me ask you this: what it means to be MLLib-interoperable? is
MLLib even interoperable within itself?

E.g. i remember there was one most frequent request on the list here: how
can we cluster dimensionally-reduced data?

Let's look what it takes to do this in MLLib: First, we run tf-idf, which
produces collection of vectors (and where did our document ids go? not
sure); then we'd have to run svd or pca, both of which would accept
RowMatrix (bummer! but we have collection of vectors); which would produce
RowMatrix as well but kmeans training takes RDD of vectors (bummer again!).

Not directly pluggable, although semi-trivially or trivially convertible.
Plus strips off information that we potentially already have computed
earlier in the pipeline, so we'd need to compute it again. I think problem
is well demonstrated.

Or, say, ALS stuff (implicit als in particular) is really an algebraic
problem. Should be taking input in form of matrices (that my feature
extraction algebraic pipeline perhaps has just prepared) but really takes
POJOs. Bummer again.

So what it is exactly we should be interoperable with in this picture if
MLLib itself is not consistent?

Let's look at the type system in flux there:

we have
(1) collection of vectors,
(2) matrix of known dimensions for collection of vectors (row matrix),
(3) indexedRowMatrix which is matrix of known dimension with keys that can
be _only_ long; and
(4) unknown but not infinitesimal amount of POJO-oriented approaches.

But ok, let's constrain ourselves to matrix types only.

Multitude of matrix types creates problems for tasks that require
consistent key propagation (like  SVD or PCA or tf-idf, well demonstrated
in the case of mllib). In the aforementioned case of dimensionality
reduction over document collection, there's simply no way to propagate
document ids to the rows of dimensionally-reduced data. As in none at all.
as in hard no-work-around-exists stop.

So. There's truly no need for multiple incompatible matrix types. There has
to be just single matrix type. Just flexible one. And everything algebraic
needs to use it.

And if geometry is needed, then it could be either already known or lazily
computed, but if it is not needed, nobody bothers to compute it. (i.e.
truly no need And this knowledge should not be lost just because we have to
convert between types.

And if we want to express complex row keys such as for cluster assignments
for example (my real case) then we could have a type with keys like
Tuple2(rowKeyType, cluster-string).

And that nobody really cares if intermediate results are really be row or
column partitioned.

All within single type of things.

Bottom line, "interoperability" with mllib is both hard and trivial.

Trivial is because whenever you need to convert, it is one line of code and
also a trivial distributed map fusion element. (I do have pipelines
streaming mllib methods within DRM-based pipelines, not just speculating).

Hard is because there are so many types you may need/want to convert
between, so there's not much point to even try to write converters for all
possible cases but rather go on need-to-do basis.

It is also hard because their type system obviously continues evolving as
we speak. So no point chase the rabbit in the making.

Epilogue
=======
There's no problem with the philosophy of the distributed and
non-distributed algebra approach. It is incredibly useful in practice and I
have proven it continuously (what is in public domain is just tip of the
iceberg).

Rather, there's organizational anemia in the project. Like corporate legal
interests (that includes me not being able to do quick turnaround of
fixes), and not having been able to tap into university resources. But i
don't believe in any technical philosophy problem.

So given that aforementioned resource/logistical anemia, it will likely
take some when it would seem it  gets worse  before it gets better. But
afaik there are multiple efforts going on behind the curtains to break red
tapes. so i'd just wait a bit.

Re: Codebase refactoring proposal

Reply via email to