Re: GPU Acceleration of Spark Logistic Regression and Other MLlib libraries

2016-01-22 Thread Sam Halliday
Hi all,

(I'm author of netlib-java)

Interesting to see this discussion come to life again.

JNI is quite limiting: pinning (or critical array access) essentially
disables the GC for the whole JVM for the duration of the native call. I
can justify this for CPU heavy tasks because frankly there are not going to
be any free cycles to do anything other than BLAS when a dense matrix in
being crunched. For GPU tasks, you could get into some hairy problems and
achieve OOM just by doing basic work.

The other big problem with JNI is that this memory is either on the heap
(and subject to the whims of GC, large pause times in tenured cleanups) or
is a lightweight reference to a huge off-heap object and the GC might never
clean it up. There are hacks around this, but none are satisfactory.

More at my talk at Scala Exchange http://fommil.github.io/scalax14/#/

I have a roadmap to move netlib-java over to ByteBuffers as they solve all
the problems I have seen. It would be an effective rewrite (down to the
Fortran JVM compiler) and would change the java API in a systematic way,
but could support BLA-like GPUs at the same time. I would be willing to
migrate all the major libraries that are using netlib-java as part of this
effort.

However, I have no commercial incentive to perform this work, so I would be
seeking funding to do it. I will not be starting anything without funding.
Please contact me if you would be a willing stakeholder. I estimate it as a
6 month project: all major platforms, along with a CI build making it easy
to update, with testing.
On 22 Jan 2016 3:48 p.m., "Rajesh Bordawekar" <bor...@us.ibm.com> wrote:

> Hi Alexander,
>
> We, at IBM Watson Research, are also working on GPU acceleration of Spark,
> but we have taken an approach that is complimentary to Ishizaki-san's
> direction. Our focus is to develop runtime infrastructure to enable
> multi-node multi-GPU exploitation in the Spark environment. The key goal of
> our approach is to enable **transparent** invocation of GPUs, without
> requiring the user to change a single line of code. Users may need to add a
> Spark configuration flag to direct the system on the GPU usage (exact
> semantics are currently being debated).
>
> Currently, we have LFBGS-based Logistic Regression model building and
> prediction implemented on a multi-node multi-GPU environment (the model
> building is done on single node). We are using our own implementation of
> LBFGS as a baseline for the GPU code. The GPU code used cublas (I presume
> that's what you meant by NVBLAS) wherever possible, and indeed, we arrange
> the execution so that cublas operates on larger matrices. We are using JNI
> to invoke CUDA from Scala and we have not seen any performance degradation
> due to JNI-based invocation.
>
> We are in the process of implementing ADMM based distributed optimization
> function, which would build the model in parallel (currently uses LBFGS as
> its individual kernel- can be replaced by any other kernel as well). The
> ADMM function would also be accelerated in a multi-node multi-user
> environment. We are planning to shift to DataSets/Dataframes soon and
> support other Logistic regression kernels such as Quasi-Newton based
> approaches.
>
> We have also enabled the Spark MLLib ALS algorithm to run on a multi-node
> multi-GPU system (ALS code also uses cublas/cusparse). Next, we will be
> covering additional functions for GPU exploitation, e.g., word2Vec (CBOW
> and Skip-gram with Negative Sampling), Glove, etc..
>
> Regarding comparison to BIDMat/BIDMach, we have studied it in detail and
> have been using it as a guide on integrating GPU code with Scala. However,
> I think comparing end-to-end results would not be appropriate as we are
> affected by Spark's runtime costs; specifically, a single Spark function to
> convert RDD to arrays is very expensive and impacts our end-to-end
> performance severely (from 200+ gain for the GPU kernel to 25+ for the
> Spark library function). In contrast, BIDMach has a very light and
> efficient layer between their GPU kernel and the user program.
>
> Finally, we are building a comprehensive multi-node multi-GPU resource
> management and discovery component in spark. We are planning to augment the
> existing Spark resource management UI to include GPU resources.
>
> Please let me know if you have questions/comments! I will attending at the
> Spark Summit East, and can meet in person to discuss any details.
>
> -regards,
> Rajesh
>
>
> - Forwarded by Randy Swanberg/Austin/IBM on 01/21/2016 09:31 PM -
>
> From: "Ulanov, Alexander" <alexander.ula...@hpe.com>
> To: Kazuaki Ishizaki <ishiz...@jp.ibm.com>, "dev@spark.apache.org" <
> dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com>
&g

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sam Halliday
I'm not at all surprised ;-) I fully expect the GPU performance to get
better automatically as the hardware improves.

Netlib natives still need to be shipped separately. I'd also oppose any
move to make Open BLAS the default - is not always better and I think
natives really need DevOps buy-in. It's not the right solution for
everybody.
On 26 Mar 2015 01:23, Evan R. Sparks evan.spa...@gmail.com wrote:

 Yeah, much more reasonable - nice to know that we can get full GPU
 performance from breeze/netlib-java - meaning there's no compelling
 performance reason to switch out our current linear algebra library (at
 least as far as this benchmark is concerned).

 Instead, it looks like a user guide for configuring Spark/MLlib to use the
 right BLAS library will get us most of the way there. Or, would it make
 sense to finally ship openblas compiled for some common platforms (64-bit
 linux, windows, mac) directly with Spark - hopefully eliminating the jblas
 warnings once and for all for most users? (Licensing is BSD) Or am I
 missing something?

 On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 As everyone suggested, the results were too good to be true, so I
 double-checked them. It turns that nvblas did not do multiplication due to
 parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My
 previously posted results with nvblas are matrices copying only. The
 default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
 handpicked other values that worked. As a result, netlib+nvblas is on par
 with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
 configuration.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



 -Original Message-
 From: Ulanov, Alexander
 Sent: Wednesday, March 25, 2015 2:31 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks;
 jfcanny
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran blas
 functions, at the same time netlib-java uses C cblas functions. So, one
 needs cblas shared library to use nvblas through netlib-java. Fedora does
 not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
 could not use cblas from Atlas or Openblas because they link to their
 implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas functions
 should replace current blas functions calls after executing LD_PRELOAD as
 suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
 changes to netlib-java. It seems to work for simple Java example, but I
 cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes:   GPU
 Memory |
 |  GPU   PID  Type  Process name   Usage
 |

 |=|
 |0  8873C   bash
 39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
 39MiB |

 +-+

 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used.
 However, matrix multiplication does executes on CPU since I see 16% of CPU
 used and 0% of GPU used. I also checked different matrix sizes, from
 100x100 to 12000x12000

 Could you

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Sam Halliday
Btw, OpenBLAS requires GPL runtime binaries which are typically considered
system libraries (and these fall under something similar to the Java
classpath exception rule)... so it's basically impossible to distribute
OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in
Spark right now to clear up something of this nature.

On a more technical level, I'd recommend watching my talk at ScalaX which
explains in detail why high performance only comes from machine optimised
binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway
on the CPU, not OpenBLAS).

On an even deeper level, using natives has consequences to JIT and GC which
isn't suitable for everybody and we'd really like people to go into that
with their eyes wide open.
On 26 Mar 2015 07:43, Sam Halliday sam.halli...@gmail.com wrote:

 I'm not at all surprised ;-) I fully expect the GPU performance to get
 better automatically as the hardware improves.

 Netlib natives still need to be shipped separately. I'd also oppose any
 move to make Open BLAS the default - is not always better and I think
 natives really need DevOps buy-in. It's not the right solution for
 everybody.
 On 26 Mar 2015 01:23, Evan R. Sparks evan.spa...@gmail.com wrote:

 Yeah, much more reasonable - nice to know that we can get full GPU
 performance from breeze/netlib-java - meaning there's no compelling
 performance reason to switch out our current linear algebra library (at
 least as far as this benchmark is concerned).

 Instead, it looks like a user guide for configuring Spark/MLlib to use
 the right BLAS library will get us most of the way there. Or, would it make
 sense to finally ship openblas compiled for some common platforms (64-bit
 linux, windows, mac) directly with Spark - hopefully eliminating the jblas
 warnings once and for all for most users? (Licensing is BSD) Or am I
 missing something?

 On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 As everyone suggested, the results were too good to be true, so I
 double-checked them. It turns that nvblas did not do multiplication due to
 parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My
 previously posted results with nvblas are matrices copying only. The
 default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
 handpicked other values that worked. As a result, netlib+nvblas is on par
 with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
 configuration.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing



 -Original Message-
 From: Ulanov, Alexander
 Sent: Wednesday, March 25, 2015 2:31 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R.
 Sparks; jfcanny
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran
 blas functions, at the same time netlib-java uses C cblas functions. So,
 one needs cblas shared library to use nvblas through netlib-java. Fedora
 does not have cblas (but Debian and Ubuntu have), so I needed to compile
 it. I could not use cblas from Atlas or Openblas because they link to their
 implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas functions
 should replace current blas functions calls after executing LD_PRELOAD as
 suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
 changes to netlib-java. It seems to work for simple Java example, but I
 cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Sam Halliday
That would be a difficult task that would only benefit users of
netlib-java. MultiBLAS is easily implemented (although a lot of
boilerplate) and benefits all BLAS users on the system.

If anyone knows of a funding route for it, I'd love to hear from them,
because it's too much work for me to take on at the moment as hobby.
On 25 Mar 2015 22:16, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Sam,

 whould it be easier to hack netlib-java to allow multiple (configurable)
  library contexts? And so enable 3rd party configurations and optimizers to
 make their own choices until then?

 On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday sam.halli...@gmail.com
 wrote:

 Yeah, MultiBLAS... it is dynamic.

 Except, I haven't written it yet :-P
 On 25 Mar 2015 22:06, Ulanov, Alexander alexander.ula...@hp.com
 wrote:

  Netlib knows nothing about GPU (or CPU), it just uses cblas symbols
 from the provided libblas.so.3 library at the runtime. So, you can switch
 at the runtime by providing another library. Sam, please suggest if there
 is another way.



 *From:* Dmitriy Lyubimov [mailto:dlie...@gmail.com]
 *Sent:* Wednesday, March 25, 2015 2:55 PM
 *To:* Ulanov, Alexander
 *Cc:* Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph
 Bradley; Evan R. Sparks; jfcanny
 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 Alexander,



 does using netlib imply that one cannot switch between CPU and GPU blas
 alternatives at will at the same time? the choice is always determined by
 linking aliternatives to libblas.so, right?



 On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran
 blas functions, at the same time netlib-java uses C cblas functions. So,
 one needs cblas shared library to use nvblas through netlib-java. Fedora
 does not have cblas (but Debian and Ubuntu have), so I needed to compile
 it. I could not use cblas from Atlas or Openblas because they link to their
 implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander

 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas functions
 should replace current blas functions calls after executing LD_PRELOAD as
 suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
 changes to netlib-java. It seems to work for simple Java example, but I
 cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes:   GPU
 Memory |
 |  GPU   PID  Type  Process name
  Usage  |

 |=|
 |0  8873C   bash
 39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
 39MiB |

 +-+

 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used.
 However, matrix multiplication does executes on CPU since I see 16% of CPU
 used and 0% of GPU used. I also checked different matrix sizes, from
 100x100 to 12000x12000

 Could you suggest might the LD_PRELOAD not affect Spark shell?

 Best regards, Alexander



 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Monday, March 09, 2015 6:01 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra


 Thanks so much for following up on this!

 Hmm, I wonder if we should have a concerted

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Sam Halliday
If you write it up I'll add it to the netlib-java wiki :-)

BTW, does it automatically flip between cpu/GPU? I've a project called
MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote:

 Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
 too good... did you check the results for correctness? - also, is it
 possible that the unified memory model of nvblas is somehow hiding pci
 transfer time?)

 this last bit (getting nvblas + netlib-java to play together) sounds like
 it's non-trivial and took you a while to figure out! Would you mind posting
 a gist or something of maybe the shell scripts/exports you used to make
 this work - I can imagine it being highly useful for others in the future.

 Thanks!
 Evan

 On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander 
 alexander.ula...@hp.com wrote:

 Hi again,

 I finally managed to use nvblas within Spark+netlib-java. It has
 exceptional performance for big matrices with Double, faster than
 BIDMat-cuda with Float. But for smaller matrices, if you will copy them
 to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
 original nvblas presentation on GPU conf 2013 (slide 21):
 http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf

 My results:

 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Just in case, these tests are not for generalization of performance of
 different libraries. I just want to pick a library that does at best dense
 matrices multiplication for my task.

 P.S. My previous issue with nvblas was the following: it has Fortran blas
 functions, at the same time netlib-java uses C cblas functions. So, one
 needs cblas shared library to use nvblas through netlib-java. Fedora does
 not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
 could not use cblas from Atlas or Openblas because they link to their
 implementation and not to Fortran blas.

 Best regards, Alexander

 -Original Message-
 From: Ulanov, Alexander
 Sent: Tuesday, March 24, 2015 6:57 PM
 To: Sam Halliday
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra

 Hi,

 I am trying to use nvblas with netlib-java from Spark. nvblas functions
 should replace current blas functions calls after executing LD_PRELOAD as
 suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
 changes to netlib-java. It seems to work for simple Java example, but I
 cannot make it work with Spark. I run the following:
 export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
 --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:

 +-+
 | Processes:   GPU
 Memory |
 |  GPU   PID  Type  Process name   Usage
 |

 |=|
 |0  8873C   bash
 39MiB |
 |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
 39MiB |

 +-+

 In Spark shell I do matrix multiplication and see the following:
 15/03/25 06:48:01 INFO JniLoader: successfully loaded
 /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
 So I am sure that netlib-native is loaded and cblas supposedly used.
 However, matrix multiplication does executes on CPU since I see 16% of CPU
 used and 0% of GPU used. I also checked different matrix sizes, from
 100x100 to 12000x12000

 Could you suggest might the LD_PRELOAD not affect Spark shell?

 Best regards, Alexander



 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Monday, March 09, 2015 6:01 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
 Subject: RE: Using CUDA within Spark / boosting linear algebra


 Thanks so much for following up on this!

 Hmm, I wonder if we should have a concerted effort to chart performance
 on various pieces of hardware...
 On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto:
 alexander.ula...@hp.com wrote:
 Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
 comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
 support of Double in the current source code), did the test with BIDMat and
 CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Best regards, Alexander

 -Original Message-
 From: Sam Halliday [mailto:sam.halli...@gmail.commailto:
 sam.halli...@gmail.com

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Sam Halliday
Thanks so much for following up on this!

Hmm, I wonder if we should have a concerted effort to chart performance on
various pieces of hardware...
On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.com wrote:

 Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
 comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
 support of Double in the current source code), did the test with BIDMat and
 CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.


 https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing

 Best regards, Alexander

 -Original Message-
 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Tuesday, March 03, 2015 1:54 PM
 To: Xiangrui Meng; Joseph Bradley
 Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 BTW, is anybody on this list going to the London Meetup in a few weeks?


 https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community

 Would be nice to meet other people working on the guts of Spark! :-)


 Xiangrui Meng men...@gmail.com writes:

  Hey Alexander,
 
  I don't quite understand the part where netlib-cublas is about 20x
  slower than netlib-openblas. What is the overhead of using a GPU BLAS
  with netlib-java?
 
  CC'ed Sam, the author of netlib-java.
 
  Best,
  Xiangrui
 
  On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com
 wrote:
  Better documentation for linking would be very helpful!  Here's a JIRA:
  https://issues.apache.org/jira/browse/SPARK-6019
 
 
  On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
  evan.spa...@gmail.com
  wrote:
 
  Thanks for compiling all the data and running these benchmarks,
  Alex. The big takeaways here can be seen with this chart:
 
  https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
  Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive
 
  1) A properly configured GPU matrix multiply implementation (e.g.
  BIDMat+GPU) can provide substantial (but less than an order of
  BIDMat+magnitude)
  benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
  netlib-java+openblas-compiled).
  2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
  worse than a well-tuned CPU implementation, particularly for larger
 matrices.
  (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
  basically agrees with the authors own benchmarks (
  https://github.com/fommil/netlib-java)
 
  I think that most of our users are in a situation where using GPUs
  may not be practical - although we could consider having a good GPU
  backend available as an option. However, *ALL* users of MLlib could
  benefit (potentially tremendously) from using a well-tuned CPU-based
  BLAS implementation. Perhaps we should consider updating the mllib
  guide with a more complete section for enabling high performance
  binaries on OSX and Linux? Or better, figure out a way for the
  system to fetch these automatically.
 
  - Evan
 
 
 
  On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
  alexander.ula...@hp.com wrote:
 
  Just to summarize this thread, I was finally able to make all
  performance comparisons that we discussed. It turns out that:
  BIDMat-cublasBIDMat
  MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo=
  =netlib-cublasnetlib-blasf2jblas
 
  Below is the link to the spreadsheet with full results.
 
  https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
  378T9J5r7kwKSPkY/edit?usp=sharing
 
  One thing still needs exploration: does BIDMat-cublas perform
  copying to/from machine’s RAM?
 
  -Original Message-
  From: Ulanov, Alexander
  Sent: Tuesday, February 10, 2015 2:12 PM
  To: Evan R. Sparks
  Cc: Joseph Bradley; dev@spark.apache.org
  Subject: RE: Using CUDA within Spark / boosting linear algebra
 
  Thanks, Evan! It seems that ticket was marked as duplicate though
  the original one discusses slightly different topic. I was able to
  link netlib with MKL from BIDMat binaries. Indeed, MKL is
  statically linked inside a 60MB library.
 
  |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
  Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
 
 +---+
  |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
  |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
  |1,638475459 |
  |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
  1569,233228 |
 
  It turn out that pre-compiled MKL is faster than precompiled
  OpenBlas on my machine. Probably, I’ll add two more columns with
  locally compiled openblas and cuda.
 
  Alexander
 
  From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
  Sent: Monday, February 09, 2015 6:06 PM
  To: Ulanov, Alexander
  Cc: Joseph Bradley; dev@spark.apache.org
  Subject: Re

Re: Using CUDA within Spark / boosting linear algebra

2015-03-03 Thread Sam Halliday
 with JNI overheads.

 Though, it might be interesting to link Netlib-java with Intel MKL, as
 you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
 (Netlib-java) interested to compare their libraries.

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Friday, February 06, 2015 5:58 PM

 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I would build OpenBLAS yourself, since good BLAS performance comes from
 getting cache sizes, etc. set up correctly for your particular hardware -
 this is often a very tricky process (see, e.g. ATLAS), but we found that on
 relatively modern Xeon chips, OpenBLAS builds quickly and yields
 performance competitive with MKL.

 To make sure the right library is getting used, you have to make sure
 it's first on the search path - export
 LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.

 For some examples of getting netlib-java setup on an ec2 node and some
 example benchmarking code we ran a while back, see:
 https://github.com/shivaram/matrix-bench

 In particular - build-openblas-ec2.sh shows you how to build the library
 and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
 the path setup and get that library picked up by netlib-java.

 In this way - you could probably get cuBLAS set up to be used by
 netlib-java as well.

 - Evan

 On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Evan, could you elaborate on how to force BIDMat and netlib-java to force
 loading the right blas? For netlib, I there are few JVM flags, such as
 -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
 force it to use Java implementation. Not sure I understand how to force use
 a specific blas (not specific wrapper for blas).

 Btw. I have installed openblas (yum install openblas), so I suppose that
 netlib is using it.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Friday, February 06, 2015 5:19 PM
 To: Ulanov, Alexander
 Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org

 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Getting breeze to pick up the right blas library is critical for
 performance. I recommend using OpenBLAS (or MKL, if you already have it).
 It might make sense to force BIDMat to use the same underlying BLAS library
 as well.

 On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much
 slower than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.commailto:
 jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark 
 MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things

Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Sam Halliday
Also, check the JNILoader output.

Remember, for netlib-java to use your system libblas all you need to do is
setup libblas.so.3 like any native application would expect.

I haven't ever used the cublas real BLAS  implementation, so I'd be
interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check
that all the runtime links are in order.

Btw, I have some DGEMM wrappers in my netlib-java performance module... and
I also planned to write more in MultiBLAS (until I mothballed the project
for the hardware to catch up, which is probably has and now I just need a
reason to look at it)
 On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote:

 Hey Sam,

 The running times are not big O estimates:

  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.

 I think there is something wrong with the netlib/cublas combination.
 Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
 interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
 through the CPU BLAS interface we need to use NVBLAS, which intercepts
 some Level 3 CPU BLAS calls (including GEMM). So we need to load
 nvblas.so first and then some CPU BLAS library in JNI. I wonder
 whether the setup was correct.

 Alexander, could you check whether GPU is used in the netlib-cublas
 experiments? You can tell it by watching CPU/GPU usage.

 Best,
 Xiangrui

 On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday sam.halli...@gmail.com
 wrote:
  Don't use big O estimates, always measure. It used to work back in the
  days when double multiplication was a bottleneck. The computation cost is
  effectively free on both the CPU and GPU and you're seeing pure copying
  costs. Also, I'm dubious that cublas is doing what you think it is. Can
 you
  link me to the source code for DGEMM?
 
  I show all of this in my talk, with explanations, I can't stress enough
 how
  much I recommend that you watch it if you want to understand high
  performance hardware acceleration for linear algebra :-)
 
  On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:
 
  The copying overhead should be quadratic on n, while the computation
  cost is cubic on n. I can understand that netlib-cublas is slower than
  netlib-openblas on small problems. But I'm surprised to see that it is
  still 20x slower on 1x1. I did the following on a g2.2xlarge
  instance with BIDMat:
 
  val n = 1
 
  val f = rand(n, n)
  flip; f*f; val rf = flop
 
  flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg =
 flop
 
  flip; g*g; val rgg = flop
 
  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.
 
  I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas
  path. But based on the result, the data copying overhead is definitely
  not as big as 20x at n = 1.
 
  Best,
  Xiangrui
 
 
  On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com
  wrote:
   I've had some email exchanges with the author of BIDMat: it does
 exactly
   what you need to get the GPU benefit and writes higher level
 algorithms
   entirely in the GPU kernels so that the memory stays there as long as
   possible. The restriction with this approach is that it is only
 offering
   high-level algorithms so is not a toolkit for applied mathematics
   research and development --- but it works well as a toolkit for higher
   level analysis (e.g. for analysts and practitioners).
  
   I believe BIDMat's approach is the best way to get performance out of
   GPU hardware at the moment but I also have strong evidence to suggest
   that the hardware will catch up and the memory transfer costs between
   CPU/GPU will disappear meaning that there will be no need for custom
 GPU
   kernel implementations. i.e. please continue to use BLAS primitives
 when
   writing new algorithms and only go to the GPU for an alternative
   optimised implementation.
  
   Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and
 offer
   an API that looks like BLAS but takes pointers to special regions in
 the
   GPU memory region. Somebody has written a wrapper around CUDA to
 create
   a proper BLAS library but it only gives marginal performance over the
   CPU because of the memory transfer overhead.
  
   This slide from my talk
  
 http://fommil.github.io/scalax14/#/11/2
  
   says it all. X axis is matrix size, Y axis is logarithmic time to do
   DGEMM. Black line is the cheating time for the GPU and the green
 line
   is after copying the memory to/from the GPU memory. APUs have the
   potential to eliminate the green line.
  
   Best regards,
   Sam
  
  
  
   Ulanov, Alexander alexander.ula...@hp.com writes:
  
   Evan, thank you for the summary. I would like to add some more
   observations. The GPU that I used is 2.5 times cheaper than the CPU
 ($250 vs
   $100). They both are 3 years old

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
Don't use big O estimates, always measure. It used to work back in the
days when double multiplication was a bottleneck. The computation cost is
effectively free on both the CPU and GPU and you're seeing pure copying
costs. Also, I'm dubious that cublas is doing what you think it is. Can you
link me to the source code for DGEMM?

I show all of this in my talk, with explanations, I can't stress enough how
much I recommend that you watch it if you want to understand high
performance hardware acceleration for linear algebra :-)
On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:

 The copying overhead should be quadratic on n, while the computation
 cost is cubic on n. I can understand that netlib-cublas is slower than
 netlib-openblas on small problems. But I'm surprised to see that it is
 still 20x slower on 1x1. I did the following on a g2.2xlarge
 instance with BIDMat:

 val n = 1

 val f = rand(n, n)
 flip; f*f; val rf = flop

 flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop

 flip; g*g; val rgg = flop

 The CPU version finished in 12 seconds.
 The CPU-GPU-CPU version finished in 2.2 seconds.
 The GPU version finished in 1.7 seconds.

 I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas
 path. But based on the result, the data copying overhead is definitely
 not as big as 20x at n = 1.

 Best,
 Xiangrui


 On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com
 wrote:
  I've had some email exchanges with the author of BIDMat: it does exactly
  what you need to get the GPU benefit and writes higher level algorithms
  entirely in the GPU kernels so that the memory stays there as long as
  possible. The restriction with this approach is that it is only offering
  high-level algorithms so is not a toolkit for applied mathematics
  research and development --- but it works well as a toolkit for higher
  level analysis (e.g. for analysts and practitioners).
 
  I believe BIDMat's approach is the best way to get performance out of
  GPU hardware at the moment but I also have strong evidence to suggest
  that the hardware will catch up and the memory transfer costs between
  CPU/GPU will disappear meaning that there will be no need for custom GPU
  kernel implementations. i.e. please continue to use BLAS primitives when
  writing new algorithms and only go to the GPU for an alternative
  optimised implementation.
 
  Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
  an API that looks like BLAS but takes pointers to special regions in the
  GPU memory region. Somebody has written a wrapper around CUDA to create
  a proper BLAS library but it only gives marginal performance over the
  CPU because of the memory transfer overhead.
 
  This slide from my talk
 
http://fommil.github.io/scalax14/#/11/2
 
  says it all. X axis is matrix size, Y axis is logarithmic time to do
  DGEMM. Black line is the cheating time for the GPU and the green line
  is after copying the memory to/from the GPU memory. APUs have the
  potential to eliminate the green line.
 
  Best regards,
  Sam
 
 
 
  Ulanov, Alexander alexander.ula...@hp.com writes:
 
  Evan, thank you for the summary. I would like to add some more
 observations. The GPU that I used is 2.5 times cheaper than the CPU ($250
 vs $100). They both are 3 years old. I've also did a small test with modern
 hardware, and the new GPU nVidia Titan was slightly more than 1 order of
 magnitude faster than Intel E5-2650 v2 for the same tests. However, it
 costs as much as CPU ($1200). My takeaway is that GPU is making a better
 price/value progress.
 
 
 
  Xiangrui, I was also surprised that BIDMat-cuda was faster than
 netlib-cuda and the most reasonable explanation is that it holds the result
 in GPU memory, as Sam suggested. At the same time, it is OK because you can
 copy the result back from GPU only when needed. However, to be sure, I am
 going to ask the developer of BIDMat on his upcoming talk.
 
 
 
  Best regards, Alexander
 
 
  From: Sam Halliday [mailto:sam.halli...@gmail.com]
  Sent: Thursday, February 26, 2015 1:56 PM
  To: Xiangrui Meng
  Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
 Sparks
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
 
  Btw, I wish people would stop cheating when comparing CPU and GPU
 timings for things like matrix multiply :-P
 
  Please always compare apples with apples and include the time it takes
 to set up the matrices, send it to the processing unit, doing the
 calculation AND copying it back to where you need to see the results.
 
  Ignoring this method will make you believe that your GPU is thousands
 of times faster than it really is. Again, jump to the end of my talk for
 graphs and more discussion  especially the bit about me being keen on
 funding to investigate APU hardware further ;-) (I believe it will solve
 the problem)
  On 26 Feb 2015 21:16, Xiangrui Meng men

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
. I am going to build
 OpenBLAS,
  link it with Netlib-java and perform benchmark again.
 
  Do I understand correctly that BIDMat binaries contain statically
 linked
  Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
  having MKL BLAS installed on my server. If it is true, I wonder if it
 is OK
  because Intel sells this library. Nevertheless, it seems that in my
 case
  precompiled MKL BLAS performs better than precompiled OpenBLAS given
 that
  BIDMat and Netlib-java are supposed to be on par with JNI overheads.
 
  Though, it might be interesting to link Netlib-java with Intel MKL, as
  you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
  (Netlib-java) interested to compare their libraries.
 
  Best regards, Alexander
 
  From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
  evan.spa...@gmail.com]
  Sent: Friday, February 06, 2015 5:58 PM
 
  To: Ulanov, Alexander
  Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
  I would build OpenBLAS yourself, since good BLAS performance comes from
  getting cache sizes, etc. set up correctly for your particular
 hardware -
  this is often a very tricky process (see, e.g. ATLAS), but we found
 that on
  relatively modern Xeon chips, OpenBLAS builds quickly and yields
  performance competitive with MKL.
 
  To make sure the right library is getting used, you have to make sure
  it's first on the search path - export
  LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
 
  For some examples of getting netlib-java setup on an ec2 node and some
  example benchmarking code we ran a while back, see:
  https://github.com/shivaram/matrix-bench
 
  In particular - build-openblas-ec2.sh shows you how to build the
 library
  and set up symlinks correctly, and scala/run-netlib.sh shows you how
 to get
  the path setup and get that library picked up by netlib-java.
 
  In this way - you could probably get cuBLAS set up to be used by
  netlib-java as well.
 
  - Evan
 
  On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
  alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
  Evan, could you elaborate on how to force BIDMat and netlib-java to
 force
  loading the right blas? For netlib, I there are few JVM flags, such as
  -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
 can
  force it to use Java implementation. Not sure I understand how to
 force use
  a specific blas (not specific wrapper for blas).
 
  Btw. I have installed openblas (yum install openblas), so I suppose
 that
  netlib is using it.
 
  From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
  evan.spa...@gmail.com]
  Sent: Friday, February 06, 2015 5:19 PM
  To: Ulanov, Alexander
  Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
  Getting breeze to pick up the right blas library is critical for
  performance. I recommend using OpenBLAS (or MKL, if you already have
 it).
  It might make sense to force BIDMat to use the same underlying BLAS
 library
  as well.
 
  On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
  alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
  Hi Evan, Joseph
 
  I did few matrix multiplication test and BIDMat seems to be ~10x faster
  than netlib-java+breeze (sorry for weird table formatting):
 
  |A*B  size | BIDMat MKL | Breeze+Netlib-java
 native_system_linux_x86-64|
  Breeze+Netlib-java f2jblas |
 
 +---+
  |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
  |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
  |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |
 
  Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
  Linux, Scala 2.11.
 
  Later I will make tests with Cuda. I need to install new Cuda version
 for
  this purpose.
 
  Do you have any ideas why breeze-netlib with native blas is so much
  slower than BIDMat MKL?
 
  Best regards, Alexander
 
  From: Joseph Bradley [mailto:jos...@databricks.commailto:
  jos...@databricks.com]
  Sent: Thursday, February 05, 2015 5:29 PM
  To: Ulanov, Alexander
  Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
  Hi Alexander,
 
  Using GPUs with Spark would be very exciting.  Small comment:
 Concerning
  your question earlier about keeping data stored on the GPU rather than
  having to move it between main memory and GPU memory on each
 iteration, I
  would guess this would be critical to getting good performance.  If you
  could do multiple local iterations before aggregating results, then the
  cost of data movement to the GPU could be amortized (and I believe
 that is
  done in practice).  Having Spark be aware of the GPU and using it as
  another part

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
...@hp.commailto:alexander.ula...@hp.com wrote:
  Hi Evan,
 
  Thank you for explanation and useful link. I am going to build
 OpenBLAS,
  link it with Netlib-java and perform benchmark again.
 
  Do I understand correctly that BIDMat binaries contain statically
 linked
  Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
  having MKL BLAS installed on my server. If it is true, I wonder if it
 is OK
  because Intel sells this library. Nevertheless, it seems that in my
 case
  precompiled MKL BLAS performs better than precompiled OpenBLAS given
 that
  BIDMat and Netlib-java are supposed to be on par with JNI overheads.
 
  Though, it might be interesting to link Netlib-java with Intel MKL, as
  you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
  (Netlib-java) interested to compare their libraries.
 
  Best regards, Alexander
 
  From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
  evan.spa...@gmail.com]
  Sent: Friday, February 06, 2015 5:58 PM
 
  To: Ulanov, Alexander
  Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
  I would build OpenBLAS yourself, since good BLAS performance comes from
  getting cache sizes, etc. set up correctly for your particular
 hardware -
  this is often a very tricky process (see, e.g. ATLAS), but we found
 that on
  relatively modern Xeon chips, OpenBLAS builds quickly and yields
  performance competitive with MKL.
 
  To make sure the right library is getting used, you have to make sure
  it's first on the search path - export
  LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
 
  For some examples of getting netlib-java setup on an ec2 node and some
  example benchmarking code we ran a while back, see:
  https://github.com/shivaram/matrix-bench
 
  In particular - build-openblas-ec2.sh shows you how to build the
 library
  and set up symlinks correctly, and scala/run-netlib.sh shows you how
 to get
  the path setup and get that library picked up by netlib-java.
 
  In this way - you could probably get cuBLAS set up to be used by
  netlib-java as well.
 
  - Evan
 
  On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
  alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
  Evan, could you elaborate on how to force BIDMat and netlib-java to
 force
  loading the right blas? For netlib, I there are few JVM flags, such as
  -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
 can
  force it to use Java implementation. Not sure I understand how to
 force use
  a specific blas (not specific wrapper for blas).
 
  Btw. I have installed openblas (yum install openblas), so I suppose
 that
  netlib is using it.
 
  From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
  evan.spa...@gmail.com]
  Sent: Friday, February 06, 2015 5:19 PM
  To: Ulanov, Alexander
  Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org
 
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
  Getting breeze to pick up the right blas library is critical for
  performance. I recommend using OpenBLAS (or MKL, if you already have
 it).
  It might make sense to force BIDMat to use the same underlying BLAS
 library
  as well.
 
  On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
  alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
  Hi Evan, Joseph
 
  I did few matrix multiplication test and BIDMat seems to be ~10x faster
  than netlib-java+breeze (sorry for weird table formatting):
 
  |A*B  size | BIDMat MKL | Breeze+Netlib-java
 native_system_linux_x86-64|
  Breeze+Netlib-java f2jblas |
 
 +---+
  |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
  |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
  |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |
 
  Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
  Linux, Scala 2.11.
 
  Later I will make tests with Cuda. I need to install new Cuda version
 for
  this purpose.
 
  Do you have any ideas why breeze-netlib with native blas is so much
  slower than BIDMat MKL?
 
  Best regards, Alexander
 
  From: Joseph Bradley [mailto:jos...@databricks.commailto:
  jos...@databricks.com]
  Sent: Thursday, February 05, 2015 5:29 PM
  To: Ulanov, Alexander
  Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
  Subject: Re: Using CUDA within Spark / boosting linear algebra
 
  Hi Alexander,
 
  Using GPUs with Spark would be very exciting.  Small comment:
 Concerning
  your question earlier about keeping data stored on the GPU rather than
  having to move it between main memory and GPU memory on each
 iteration, I
  would guess this would be critical to getting good performance.  If you
  could do multiple local iterations before aggregating results, then the
  cost of data movement to the GPU could be amortized (and I believe

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
I've had some email exchanges with the author of BIDMat: it does exactly
what you need to get the GPU benefit and writes higher level algorithms
entirely in the GPU kernels so that the memory stays there as long as
possible. The restriction with this approach is that it is only offering
high-level algorithms so is not a toolkit for applied mathematics
research and development --- but it works well as a toolkit for higher
level analysis (e.g. for analysts and practitioners).

I believe BIDMat's approach is the best way to get performance out of
GPU hardware at the moment but I also have strong evidence to suggest
that the hardware will catch up and the memory transfer costs between
CPU/GPU will disappear meaning that there will be no need for custom GPU
kernel implementations. i.e. please continue to use BLAS primitives when
writing new algorithms and only go to the GPU for an alternative
optimised implementation.

Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
an API that looks like BLAS but takes pointers to special regions in the
GPU memory region. Somebody has written a wrapper around CUDA to create
a proper BLAS library but it only gives marginal performance over the
CPU because of the memory transfer overhead.

This slide from my talk

  http://fommil.github.io/scalax14/#/11/2

says it all. X axis is matrix size, Y axis is logarithmic time to do
DGEMM. Black line is the cheating time for the GPU and the green line
is after copying the memory to/from the GPU memory. APUs have the
potential to eliminate the green line.

Best regards,
Sam


Ulanov, Alexander alexander.ula...@hp.com writes:

 Evan, thank you for the summary. I would like to add some more observations. 
 The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They 
 both are 3 years old. I've also did a small test with modern hardware, and 
 the new GPU nVidia Titan was slightly more than 1 order of magnitude faster 
 than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU 
 ($1200). My takeaway is that GPU is making a better price/value progress.



 Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda 
 and the most reasonable explanation is that it holds the result in GPU 
 memory, as Sam suggested. At the same time, it is OK because you can copy the 
 result back from GPU only when needed. However, to be sure, I am going to ask 
 the developer of BIDMat on his upcoming talk.



 Best regards, Alexander


 From: Sam Halliday [mailto:sam.halli...@gmail.com]
 Sent: Thursday, February 26, 2015 1:56 PM
 To: Xiangrui Meng
 Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
 Subject: Re: Using CUDA within Spark / boosting linear algebra


 Btw, I wish people would stop cheating when comparing CPU and GPU timings for 
 things like matrix multiply :-P

 Please always compare apples with apples and include the time it takes to set 
 up the matrices, send it to the processing unit, doing the calculation AND 
 copying it back to where you need to see the results.

 Ignoring this method will make you believe that your GPU is thousands of 
 times faster than it really is. Again, jump to the end of my talk for graphs 
 and more discussion  especially the bit about me being keen on funding to 
 investigate APU hardware further ;-) (I believe it will solve the problem)
 On 26 Feb 2015 21:16, Xiangrui Meng 
 men...@gmail.commailto:men...@gmail.com wrote:
 Hey Alexander,

 I don't quite understand the part where netlib-cublas is about 20x
 slower than netlib-openblas. What is the overhead of using a GPU BLAS
 with netlib-java?

 CC'ed Sam, the author of netlib-java.

 Best,
 Xiangrui

 On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
 jos...@databricks.commailto:jos...@databricks.com wrote:
 Better documentation for linking would be very helpful!  Here's a JIRA:
 https://issues.apache.org/jira/browse/SPARK-6019


 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
 evan.spa...@gmail.commailto:evan.spa...@gmail.com
 wrote:

 Thanks for compiling all the data and running these benchmarks, Alex. The
 big takeaways here can be seen with this chart:

 https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive

 1) A properly configured GPU matrix multiply implementation (e.g.
 BIDMat+GPU) can provide substantial (but less than an order of magnitude)
 benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
 netlib-java+openblas-compiled).
 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
 than a well-tuned CPU implementation, particularly for larger matrices.
 (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
 basically agrees with the authors own benchmarks (
 https://github.com/fommil/netlib-java)

 I think that most of our users are in a situation where using GPUs may not
 be practical - although we could consider having a good GPU