[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

MLnick Thu, 13 Feb 2014 11:28:58 -0800
Github user MLnick commented on the pull request:

    https://github.com/apache/incubator-spark/pull/575#issuecomment-35015689
  
    I guess we should discuss on the JIRA ideally, but most of the discussion 
seems to be here. So I'll comment here.
    
    I've chimed in before in the original PR - at the time I advocated for 
Breeze or mahout-math (with the new Scala DSL potentially).
    
    I still advocate for Breeze on the basis that it's a Scala project in the 
data / ML community, and I believe that it's a great base project that the 
Scala community should support. Yes it may need some bug fixes and performance 
enhancements but with netlib-java at its core it should be able to be 
performant. If there's going to be effort put into Spark sparse matrix code why 
not have all involved put effort into improving Breeze and making it Scala's 
numpy? As evidenced above, there may well be some easy win performance 
enhancements that can be achieved and @dlwh will I'm sure help in making a 
release when Spark would want to merge.
    
    I again raise the point that maintaining a lot (any?) of matrix algebra 
code should not be a goal for Spark MLlib. While mahout-math is a highly 
commendable project I think the weight of maintenance of such a library on the 
project and committers is clear.
    
    If Breeze is not an option because of implicit-related slowness etc, and 
mahout-math is not an option for whatever reason, then how about MTJ 
(https://github.com/fommil/matrix-toolkits-java)? It's updated a lot in recent 
times, is based on netlib-java with native support, and outperforms JBLAS. Its 
pure Java performance is decent and could be wrapped in a lightweight DSL (like 
Dmitriy's mahout-math effort). The API is decent. It has sparse matrix and 
vector support.
    
    For distributed ML all that is really needed 90% of the time is dense 
vectors, dense matrices and sparse vectors (with dot product, element-wise 
stuff, solvers for things like ALS and possibly the odd matrix multiply). It is 
true that the sparse features missing are fairly lightweight so ultimately it 
doesn't much matter what is chosen so long as its consistent, reasonably 
performant and APIs are clean and simple as possible.
[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

Reply via email to