Github user MLnick commented on the pull request:
https://github.com/apache/incubator-spark/pull/575#issuecomment-35015689
I guess we should discuss on the JIRA ideally, but most of the discussion
seems to be here. So I'll comment here.
I've chimed in before in the original PR - at the time I advocated for
Breeze or mahout-math (with the new Scala DSL potentially).
I still advocate for Breeze on the basis that it's a Scala project in the
data / ML community, and I believe that it's a great base project that the
Scala community should support. Yes it may need some bug fixes and performance
enhancements but with netlib-java at its core it should be able to be
performant. If there's going to be effort put into Spark sparse matrix code why
not have all involved put effort into improving Breeze and making it Scala's
numpy? As evidenced above, there may well be some easy win performance
enhancements that can be achieved and @dlwh will I'm sure help in making a
release when Spark would want to merge.
I again raise the point that maintaining a lot (any?) of matrix algebra
code should not be a goal for Spark MLlib. While mahout-math is a highly
commendable project I think the weight of maintenance of such a library on the
project and committers is clear.
If Breeze is not an option because of implicit-related slowness etc, and
mahout-math is not an option for whatever reason, then how about MTJ
(https://github.com/fommil/matrix-toolkits-java)? It's updated a lot in recent
times, is based on netlib-java with native support, and outperforms JBLAS. Its
pure Java performance is decent and could be wrapped in a lightweight DSL (like
Dmitriy's mahout-math effort). The API is decent. It has sparse matrix and
vector support.
For distributed ML all that is really needed 90% of the time is dense
vectors, dense matrices and sparse vectors (with dot product, element-wise
stuff, solvers for things like ALS and possibly the odd matrix multiply). It is
true that the sparse features missing are fairly lightweight so ultimately it
doesn't much matter what is chosen so long as its consistent, reasonably
performant and APIs are clean and simple as possible.