Github user mengxr commented on the pull request:

    https://github.com/apache/incubator-spark/pull/575#issuecomment-35433131
  
    Thanks all for the suggestions! 
    
    @srowen @giyengar I updated the small benchmark suite to include 
commons-math3. It seems to me commons-math3 has couple design issues. First of 
all, its sparse implementation is based on a primitive-typed hash map, which is 
not efficient for most linear algebra operations. Secondly, it doesn't support 
in-place vector operations, e.g., BLAS's axpy. Both account to the performance 
drop in the benchmarks (see the results attached). Together with the fact that 
its sparse implementation is deprecated, I wouldn't recommend using 
commons-math3 as the underlying linear algebra package of mllib.
    
    For mahout-math, it is clear in the PR code that we need some hacks to 
avoiding copying data around and it doesn't support native BLAS/LAPACK. So we 
need to wrap mathout-math and jblas or netlib-java together to provide a 
full-speed linear algebra service, which breeze already provides.
    
    @dlwh Thanks for making a quick release of breeze-0.6.1! It is the best 
overall performer in the benchmarks. I'm okay to make the switch after we clear 
the license issues (see the license part).
    
    # Benchmarks
    
    I tested only three operations:
    
    1. dense SVD, which is used in PCA and also represents the performance of 
other matrix factorizations, etc.
    2. sparse matrix times dense matrix, which is used in gradient-based 
methods (training multiple models together), etc.
    3. dense vector plus sparse vector, which is used in KMeans, normalization, 
etc.
    
    This is certainly not a complete benchmark suite. Remember this PR is for 
finding a underlying linear algebra package for mllib's sparse data support. 
For simplicity, I didn't include test details. For the benchmark code, please 
go to https://github.com/mengxr/linalg-test
    
    ## Dense SVD
    
    breeze depends on netlib-all, which includes native libraries. jblas also 
packs native libraries, but it seems the performance is not as good as 
breeze/netlib-java. @fommil @mikiobraun Do you mind sharing which BLAS/LAPACK 
implementation you chose to make those native libraries and whether you enabled 
multi-threading? Thanks!
    
    ~~~
    jblas:    685.5577ms
    breeze:   135.0402ms
    mahout:  2626.1641ms
    commons: 2151.0861ms
    ~~~
    
    ## Sparse matrix times dense matrix
    
    The `barebone` implementation is operating directly on primitive arrays. 
    
    ~~~
    barebone:  20.1036ms
    breeze:    26.4364ms
    mahout:  2562.3702ms
    commons:   56.1518ms
    ~~~
    
    ## Dense vector plus sparse vector
    
    ~~~
    barebone:   0.033ms
    breeze:     0.037ms
    mahout:     0.075ms
    commons:   25.376ms
    ~~~
    
    # breeze and netlib-java license
    
    The following is the dependency graph of breeze-0.6.1. @fommil Could you 
confirm the license of netlib-all, netlib-core, and the native libraries? The 
jniloader is distributed under LGPL. Is it possible to change it to a 
commercial-friendly license such as Apache? 
    
    ~~~
    +-org.scalanlp:breeze_2.10:0.6.1                                   Apache 
2.0
      +-com.github.fommil.netlib:all:1.1.2                             Same to 
netlib-java?
      | +-com.github.fommil.netlib:core:1.1.2                          Same to 
netlib-java?
      | +-com.github.fommil.netlib:netlib-native_ref-linux-x86_64:1.1  Same to 
netlib-java?
      | | +-com.github.fommil.netlib:native_ref-java:1.1               Same to 
netlib-java?
      | |   +-com.github.fommil:jniloader:1.1                          LGPL, is 
it okay to change it to Apache?
      | |   
      | +-net.sourceforge.f2j:arpack_combined_all:0.1                  
University of Tennessee License
      | 
      +-com.github.rwl:jtransforms:2.4.0                               
MPL/LGPL/GPL
      | +-junit:junit:4.8.2
      | 
      +-com.thoughtworks.paranamer:paranamer:2.2
      +-com.typesafe:scalalogging-slf4j_2.10:1.0.1
      | +-org.scala-lang:scala-reflect:2.10.0 (evicted by: 2.10.3)
      | +-org.scala-lang:scala-reflect:2.10.3
      | | +-org.scala-lang:scala-library:2.10.3
      | | 
      | +-org.slf4j:slf4j-api:1.7.2 (evicted by: 1.7.5)
      | +-org.slf4j:slf4j-api:1.7.5
      | 
      +-net.sf.opencsv:opencsv:2.3                                     Apache 
2.0
      +-org.apache.commons:commons-math3:3.2
      +-org.scala-lang:scala-library:2.10.3
      +-org.scalanlp:breeze-macros_2.10:0.2
        +-org.scala-lang:scala-library:2.10.3
        +-org.scala-lang:scala-reflect:2.10.3
          +-org.scala-lang:scala-library:2.10.3
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---

Reply via email to