Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35433131 Thanks all for the suggestions! @srowen @giyengar I updated the small benchmark suite to include commons-math3. It seems to me commons-math3 has couple design issues. First of all, its sparse implementation is based on a primitive-typed hash map, which is not efficient for most linear algebra operations. Secondly, it doesn't support in-place vector operations, e.g., BLAS's axpy. Both account to the performance drop in the benchmarks (see the results attached). Together with the fact that its sparse implementation is deprecated, I wouldn't recommend using commons-math3 as the underlying linear algebra package of mllib. For mahout-math, it is clear in the PR code that we need some hacks to avoiding copying data around and it doesn't support native BLAS/LAPACK. So we need to wrap mathout-math and jblas or netlib-java together to provide a full-speed linear algebra service, which breeze already provides. @dlwh Thanks for making a quick release of breeze-0.6.1! It is the best overall performer in the benchmarks. I'm okay to make the switch after we clear the license issues (see the license part). # Benchmarks I tested only three operations: 1. dense SVD, which is used in PCA and also represents the performance of other matrix factorizations, etc. 2. sparse matrix times dense matrix, which is used in gradient-based methods (training multiple models together), etc. 3. dense vector plus sparse vector, which is used in KMeans, normalization, etc. This is certainly not a complete benchmark suite. Remember this PR is for finding a underlying linear algebra package for mllib's sparse data support. For simplicity, I didn't include test details. For the benchmark code, please go to https://github.com/mengxr/linalg-test ## Dense SVD breeze depends on netlib-all, which includes native libraries. jblas also packs native libraries, but it seems the performance is not as good as breeze/netlib-java. @fommil @mikiobraun Do you mind sharing which BLAS/LAPACK implementation you chose to make those native libraries and whether you enabled multi-threading? Thanks! ~~~ jblas: 685.5577ms breeze: 135.0402ms mahout: 2626.1641ms commons: 2151.0861ms ~~~ ## Sparse matrix times dense matrix The `barebone` implementation is operating directly on primitive arrays. ~~~ barebone: 20.1036ms breeze: 26.4364ms mahout: 2562.3702ms commons: 56.1518ms ~~~ ## Dense vector plus sparse vector ~~~ barebone: 0.033ms breeze: 0.037ms mahout: 0.075ms commons: 25.376ms ~~~ # breeze and netlib-java license The following is the dependency graph of breeze-0.6.1. @fommil Could you confirm the license of netlib-all, netlib-core, and the native libraries? The jniloader is distributed under LGPL. Is it possible to change it to a commercial-friendly license such as Apache? ~~~ +-org.scalanlp:breeze_2.10:0.6.1 Apache 2.0 +-com.github.fommil.netlib:all:1.1.2 Same to netlib-java? | +-com.github.fommil.netlib:core:1.1.2 Same to netlib-java? | +-com.github.fommil.netlib:netlib-native_ref-linux-x86_64:1.1 Same to netlib-java? | | +-com.github.fommil.netlib:native_ref-java:1.1 Same to netlib-java? | | +-com.github.fommil:jniloader:1.1 LGPL, is it okay to change it to Apache? | | | +-net.sourceforge.f2j:arpack_combined_all:0.1 University of Tennessee License | +-com.github.rwl:jtransforms:2.4.0 MPL/LGPL/GPL | +-junit:junit:4.8.2 | +-com.thoughtworks.paranamer:paranamer:2.2 +-com.typesafe:scalalogging-slf4j_2.10:1.0.1 | +-org.scala-lang:scala-reflect:2.10.0 (evicted by: 2.10.3) | +-org.scala-lang:scala-reflect:2.10.3 | | +-org.scala-lang:scala-library:2.10.3 | | | +-org.slf4j:slf4j-api:1.7.2 (evicted by: 1.7.5) | +-org.slf4j:slf4j-api:1.7.5 | +-net.sf.opencsv:opencsv:2.3 Apache 2.0 +-org.apache.commons:commons-math3:3.2 +-org.scala-lang:scala-library:2.10.3 +-org.scalanlp:breeze-macros_2.10:0.2 +-org.scala-lang:scala-library:2.10.3 +-org.scala-lang:scala-reflect:2.10.3 +-org.scala-lang:scala-library:2.10.3 ~~~
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---