Re: Gokhan's PR post: here are my thoughts but i did not want to post it there since they are going beyond the scope of that PR's work to chase the root of the issue.
on quasi-algebraic methods ======================== What is the dilemma here? don't see any. I already explained that no more than 25% of algorithms are truly 100% algebraic. But about 80% cannot avoid using some algebra and close to 95% could benefit from using algebra (even stochastic and monte carlo stuff). So we are building system that allows us to cut developer's work by at least 60% and make his work also more readable by 3000%. As far as I am concerned, that fulfills the goal. And I am perfectly happy writing a mix of engine-specific primitives and algebra. That's why i am a bit skeptical about attempts to abstract non-algebraic primitives such as row-wise aggregators in one of the pull requests. Engine-specific primitives and algebra can perfectly co-exist in the guts. And that's how i am doing my stuff in practice, except i now can skip 80% effort on algebra and bridging incompatible intputs-outputs. None of that means that R-like algebra cannot be engine agnostic. So people are unhappy about not being able to write the whole in totaly agnostic way? And so they (falsely) infer the pieces of their work cannot be helped by agnosticism individually, or the tools are not being as good as they might be without backend agnosticism? Sorry, but I fail to see the logic there. We proved algebra can be agnostic. I don't think this notion should be disputed. And even if there were a shred of real benefit by making algebra tools un-agnostic, it would not ever outweigh tons of good we could get for the project by integrating with e.g. Flink folks. This one one the points MLLib will never be able to overcome -- to be truly shared ML platform where people could create and share ML, but not just a bunch of ad-hoc spaghetty of distributed api calls and Spark-nailed black boxes. Well yes methodology implementations will still have native distributed calls. Just not nearly as many as they otherwise would, and will be much more easier to support on another back-end using Strategy patterns. E.g. implicit feedback problem that i originally wrote as quasi-method for Spark only, would've taken just an hour or so to add strategy for flink, since it retains all in-core and distributed algebra work as is. Not to mention benefit of single type pipelining. And once we add hardware-accelerated bindings for in-core stuff, all these methods would immediately benefit from it. On MLLib interoperability issues, ========================= well, let me ask you this: what it means to be MLLib-interoperable? is MLLib even interoperable within itself? E.g. i remember there was one most frequent request on the list here: how can we cluster dimensionally-reduced data? Let's look what it takes to do this in MLLib: First, we run tf-idf, which produces collection of vectors (and where did our document ids go? not sure); then we'd have to run svd or pca, both of which would accept RowMatrix (bummer! but we have collection of vectors); which would produce RowMatrix as well but kmeans training takes RDD of vectors (bummer again!). Not directly pluggable, although semi-trivially or trivially convertible. Plus strips off information that we potentially already have computed earlier in the pipeline, so we'd need to compute it again. I think problem is well demonstrated. Or, say, ALS stuff (implicit als in particular) is really an algebraic problem. Should be taking input in form of matrices (that my feature extraction algebraic pipeline perhaps has just prepared) but really takes POJOs. Bummer again. So what it is exactly we should be interoperable with in this picture if MLLib itself is not consistent? Let's look at the type system in flux there: we have (1) collection of vectors, (2) matrix of known dimensions for collection of vectors (row matrix), (3) indexedRowMatrix which is matrix of known dimension with keys that can be _only_ long; and (4) unknown but not infinitesimal amount of POJO-oriented approaches. But ok, let's constrain ourselves to matrix types only. Multitude of matrix types creates problems for tasks that require consistent key propagation (like SVD or PCA or tf-idf, well demonstrated in the case of mllib). In the aforementioned case of dimensionality reduction over document collection, there's simply no way to propagate document ids to the rows of dimensionally-reduced data. As in none at all. as in hard no-work-around-exists stop. So. There's truly no need for multiple incompatible matrix types. There has to be just single matrix type. Just flexible one. And everything algebraic needs to use it. And if geometry is needed, then it could be either already known or lazily computed, but if it is not needed, nobody bothers to compute it. (i.e. truly no need And this knowledge should not be lost just because we have to convert between types. And if we want to express complex row keys such as for cluster assignments for example (my real case) then we could have a type with keys like Tuple2(rowKeyType, cluster-string). And that nobody really cares if intermediate results are really be row or column partitioned. All within single type of things. Bottom line, "interoperability" with mllib is both hard and trivial. Trivial is because whenever you need to convert, it is one line of code and also a trivial distributed map fusion element. (I do have pipelines streaming mllib methods within DRM-based pipelines, not just speculating). Hard is because there are so many types you may need/want to convert between, so there's not much point to even try to write converters for all possible cases but rather go on need-to-do basis. It is also hard because their type system obviously continues evolving as we speak. So no point chase the rabbit in the making. Epilogue ======= There's no problem with the philosophy of the distributed and non-distributed algebra approach. It is incredibly useful in practice and I have proven it continuously (what is in public domain is just tip of the iceberg). Rather, there's organizational anemia in the project. Like corporate legal interests (that includes me not being able to do quick turnaround of fixes), and not having been able to tap into university resources. But i don't believe in any technical philosophy problem. So given that aforementioned resource/logistical anemia, it will likely take some when it would seem it gets worse before it gets better. But afaik there are multiple efforts going on behind the curtains to break red tapes. so i'd just wait a bit.