Github user mengxr commented on the pull request:
https://github.com/apache/incubator-spark/pull/575#issuecomment-35433131
Thanks all for the suggestions!
@srowen @giyengar I updated the small benchmark suite to include
commons-math3. It seems to me commons-math3 has couple design issues. First of
all, its sparse implementation is based on a primitive-typed hash map, which is
not efficient for most linear algebra operations. Secondly, it doesn't support
in-place vector operations, e.g., BLAS's axpy. Both account to the performance
drop in the benchmarks (see the results attached). Together with the fact that
its sparse implementation is deprecated, I wouldn't recommend using
commons-math3 as the underlying linear algebra package of mllib.
For mahout-math, it is clear in the PR code that we need some hacks to
avoiding copying data around and it doesn't support native BLAS/LAPACK. So we
need to wrap mathout-math and jblas or netlib-java together to provide a
full-speed linear algebra service, which breeze already provides.
@dlwh Thanks for making a quick release of breeze-0.6.1! It is the best
overall performer in the benchmarks. I'm okay to make the switch after we clear
the license issues (see the license part).
# Benchmarks
I tested only three operations:
1. dense SVD, which is used in PCA and also represents the performance of
other matrix factorizations, etc.
2. sparse matrix times dense matrix, which is used in gradient-based
methods (training multiple models together), etc.
3. dense vector plus sparse vector, which is used in KMeans, normalization,
etc.
This is certainly not a complete benchmark suite. Remember this PR is for
finding a underlying linear algebra package for mllib's sparse data support.
For simplicity, I didn't include test details. For the benchmark code, please
go to https://github.com/mengxr/linalg-test
## Dense SVD
breeze depends on netlib-all, which includes native libraries. jblas also
packs native libraries, but it seems the performance is not as good as
breeze/netlib-java. @fommil @mikiobraun Do you mind sharing which BLAS/LAPACK
implementation you chose to make those native libraries and whether you enabled
multi-threading? Thanks!
~~~
jblas: 685.5577ms
breeze: 135.0402ms
mahout: 2626.1641ms
commons: 2151.0861ms
~~~
## Sparse matrix times dense matrix
The `barebone` implementation is operating directly on primitive arrays.
~~~
barebone: 20.1036ms
breeze: 26.4364ms
mahout: 2562.3702ms
commons: 56.1518ms
~~~
## Dense vector plus sparse vector
~~~
barebone: 0.033ms
breeze: 0.037ms
mahout: 0.075ms
commons: 25.376ms
~~~
# breeze and netlib-java license
The following is the dependency graph of breeze-0.6.1. @fommil Could you
confirm the license of netlib-all, netlib-core, and the native libraries? The
jniloader is distributed under LGPL. Is it possible to change it to a
commercial-friendly license such as Apache?
~~~
+-org.scalanlp:breeze_2.10:0.6.1 Apache
2.0
+-com.github.fommil.netlib:all:1.1.2 Same to
netlib-java?
| +-com.github.fommil.netlib:core:1.1.2 Same to
netlib-java?
| +-com.github.fommil.netlib:netlib-native_ref-linux-x86_64:1.1 Same to
netlib-java?
| | +-com.github.fommil.netlib:native_ref-java:1.1 Same to
netlib-java?
| | +-com.github.fommil:jniloader:1.1 LGPL, is
it okay to change it to Apache?
| |
| +-net.sourceforge.f2j:arpack_combined_all:0.1
University of Tennessee License
|
+-com.github.rwl:jtransforms:2.4.0
MPL/LGPL/GPL
| +-junit:junit:4.8.2
|
+-com.thoughtworks.paranamer:paranamer:2.2
+-com.typesafe:scalalogging-slf4j_2.10:1.0.1
| +-org.scala-lang:scala-reflect:2.10.0 (evicted by: 2.10.3)
| +-org.scala-lang:scala-reflect:2.10.3
| | +-org.scala-lang:scala-library:2.10.3
| |
| +-org.slf4j:slf4j-api:1.7.2 (evicted by: 1.7.5)
| +-org.slf4j:slf4j-api:1.7.5
|
+-net.sf.opencsv:opencsv:2.3 Apache
2.0
+-org.apache.commons:commons-math3:3.2
+-org.scala-lang:scala-library:2.10.3
+-org.scalanlp:breeze-macros_2.10:0.2
+-org.scala-lang:scala-library:2.10.3
+-org.scala-lang:scala-reflect:2.10.3
+-org.scala-lang:scala-library:2.10.3
~~~
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
[email protected] or file a JIRA ticket with INFRA.
---