Re: GPU Acceleration of Spark Logistic Regression and Other MLlib libraries
Hi all, (I'm author of netlib-java) Interesting to see this discussion come to life again. JNI is quite limiting: pinning (or critical array access) essentially disables the GC for the whole JVM for the duration of the native call. I can justify this for CPU heavy tasks because frankly there are not going to be any free cycles to do anything other than BLAS when a dense matrix in being crunched. For GPU tasks, you could get into some hairy problems and achieve OOM just by doing basic work. The other big problem with JNI is that this memory is either on the heap (and subject to the whims of GC, large pause times in tenured cleanups) or is a lightweight reference to a huge off-heap object and the GC might never clean it up. There are hacks around this, but none are satisfactory. More at my talk at Scala Exchange http://fommil.github.io/scalax14/#/ I have a roadmap to move netlib-java over to ByteBuffers as they solve all the problems I have seen. It would be an effective rewrite (down to the Fortran JVM compiler) and would change the java API in a systematic way, but could support BLA-like GPUs at the same time. I would be willing to migrate all the major libraries that are using netlib-java as part of this effort. However, I have no commercial incentive to perform this work, so I would be seeking funding to do it. I will not be starting anything without funding. Please contact me if you would be a willing stakeholder. I estimate it as a 6 month project: all major platforms, along with a CI build making it easy to update, with testing. On 22 Jan 2016 3:48 p.m., "Rajesh Bordawekar" <bor...@us.ibm.com> wrote: > Hi Alexander, > > We, at IBM Watson Research, are also working on GPU acceleration of Spark, > but we have taken an approach that is complimentary to Ishizaki-san's > direction. Our focus is to develop runtime infrastructure to enable > multi-node multi-GPU exploitation in the Spark environment. The key goal of > our approach is to enable **transparent** invocation of GPUs, without > requiring the user to change a single line of code. Users may need to add a > Spark configuration flag to direct the system on the GPU usage (exact > semantics are currently being debated). > > Currently, we have LFBGS-based Logistic Regression model building and > prediction implemented on a multi-node multi-GPU environment (the model > building is done on single node). We are using our own implementation of > LBFGS as a baseline for the GPU code. The GPU code used cublas (I presume > that's what you meant by NVBLAS) wherever possible, and indeed, we arrange > the execution so that cublas operates on larger matrices. We are using JNI > to invoke CUDA from Scala and we have not seen any performance degradation > due to JNI-based invocation. > > We are in the process of implementing ADMM based distributed optimization > function, which would build the model in parallel (currently uses LBFGS as > its individual kernel- can be replaced by any other kernel as well). The > ADMM function would also be accelerated in a multi-node multi-user > environment. We are planning to shift to DataSets/Dataframes soon and > support other Logistic regression kernels such as Quasi-Newton based > approaches. > > We have also enabled the Spark MLLib ALS algorithm to run on a multi-node > multi-GPU system (ALS code also uses cublas/cusparse). Next, we will be > covering additional functions for GPU exploitation, e.g., word2Vec (CBOW > and Skip-gram with Negative Sampling), Glove, etc.. > > Regarding comparison to BIDMat/BIDMach, we have studied it in detail and > have been using it as a guide on integrating GPU code with Scala. However, > I think comparing end-to-end results would not be appropriate as we are > affected by Spark's runtime costs; specifically, a single Spark function to > convert RDD to arrays is very expensive and impacts our end-to-end > performance severely (from 200+ gain for the GPU kernel to 25+ for the > Spark library function). In contrast, BIDMach has a very light and > efficient layer between their GPU kernel and the user program. > > Finally, we are building a comprehensive multi-node multi-GPU resource > management and discovery component in spark. We are planning to augment the > existing Spark resource management UI to include GPU resources. > > Please let me know if you have questions/comments! I will attending at the > Spark Summit East, and can meet in person to discuss any details. > > -regards, > Rajesh > > > - Forwarded by Randy Swanberg/Austin/IBM on 01/21/2016 09:31 PM - > > From: "Ulanov, Alexander" <alexander.ula...@hpe.com> > To: Kazuaki Ishizaki <ishiz...@jp.ibm.com>, "dev@spark.apache.org" < > dev@spark.apache.org>, Joseph Bradley <jos...@databricks.com> &g
Re: Using CUDA within Spark / boosting linear algebra
I'm not at all surprised ;-) I fully expect the GPU performance to get better automatically as the hardware improves. Netlib natives still need to be shipped separately. I'd also oppose any move to make Open BLAS the default - is not always better and I think natives really need DevOps buy-in. It's not the right solution for everybody. On 26 Mar 2015 01:23, Evan R. Sparks evan.spa...@gmail.com wrote: Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java - meaning there's no compelling performance reason to switch out our current linear algebra library (at least as far as this benchmark is concerned). Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS library will get us most of the way there. Or, would it make sense to finally ship openblas compiled for some common platforms (64-bit linux, windows, mac) directly with Spark - hopefully eliminating the jblas warnings once and for all for most users? (Licensing is BSD) Or am I missing something? On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash 39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you
Re: Using CUDA within Spark / boosting linear algebra
Btw, OpenBLAS requires GPL runtime binaries which are typically considered system libraries (and these fall under something similar to the Java classpath exception rule)... so it's basically impossible to distribute OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in Spark right now to clear up something of this nature. On a more technical level, I'd recommend watching my talk at ScalaX which explains in detail why high performance only comes from machine optimised binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway on the CPU, not OpenBLAS). On an even deeper level, using natives has consequences to JIT and GC which isn't suitable for everybody and we'd really like people to go into that with their eyes wide open. On 26 Mar 2015 07:43, Sam Halliday sam.halli...@gmail.com wrote: I'm not at all surprised ;-) I fully expect the GPU performance to get better automatically as the hardware improves. Netlib natives still need to be shipped separately. I'd also oppose any move to make Open BLAS the default - is not always better and I think natives really need DevOps buy-in. It's not the right solution for everybody. On 26 Mar 2015 01:23, Evan R. Sparks evan.spa...@gmail.com wrote: Yeah, much more reasonable - nice to know that we can get full GPU performance from breeze/netlib-java - meaning there's no compelling performance reason to switch out our current linear algebra library (at least as far as this benchmark is concerned). Instead, it looks like a user guide for configuring Spark/MLlib to use the right BLAS library will get us most of the way there. Or, would it make sense to finally ship openblas compiled for some common platforms (64-bit linux, windows, mac) directly with Spark - hopefully eliminating the jblas warnings once and for all for most users? (Licensing is BSD) Or am I missing something? On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from nvblas.conf and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes
Re: Using CUDA within Spark / boosting linear algebra
That would be a difficult task that would only benefit users of netlib-java. MultiBLAS is easily implemented (although a lot of boilerplate) and benefits all BLAS users on the system. If anyone knows of a funding route for it, I'd love to hear from them, because it's too much work for me to take on at the moment as hobby. On 25 Mar 2015 22:16, Dmitriy Lyubimov dlie...@gmail.com wrote: Sam, whould it be easier to hack netlib-java to allow multiple (configurable) library contexts? And so enable 3rd party configurations and optimizers to make their own choices until then? On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday sam.halli...@gmail.com wrote: Yeah, MultiBLAS... it is dynamic. Except, I haven't written it yet :-P On 25 Mar 2015 22:06, Ulanov, Alexander alexander.ula...@hp.com wrote: Netlib knows nothing about GPU (or CPU), it just uses cblas symbols from the provided libblas.so.3 library at the runtime. So, you can switch at the runtime by providing another library. Sam, please suggest if there is another way. *From:* Dmitriy Lyubimov [mailto:dlie...@gmail.com] *Sent:* Wednesday, March 25, 2015 2:55 PM *To:* Ulanov, Alexander *Cc:* Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny *Subject:* Re: Using CUDA within Spark / boosting linear algebra Alexander, does using netlib imply that one cannot switch between CPU and GPU blas alternatives at will at the same time? the choice is always determined by linking aliternatives to libblas.so, right? On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash 39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted
Re: Using CUDA within Spark / boosting linear algebra
If you write it up I'll add it to the netlib-java wiki :-) BTW, does it automatically flip between cpu/GPU? I've a project called MultiBLAS which was going to do this, it should be easy (but boring to write) On 25 Mar 2015 22:00, Evan R. Sparks evan.spa...@gmail.com wrote: Alex - great stuff, and the nvblas numbers are pretty remarkable (almost too good... did you check the results for correctness? - also, is it possible that the unified memory model of nvblas is somehow hiding pci transfer time?) this last bit (getting nvblas + netlib-java to play together) sounds like it's non-trivial and took you a while to figure out! Would you mind posting a gist or something of maybe the shell scripts/exports you used to make this work - I can imagine it being highly useful for others in the future. Thanks! Evan On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| |0 8873C bash 39MiB | |0 8910C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.commailto: alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.commailto: sam.halli...@gmail.com
RE: Using CUDA within Spark / boosting linear algebra
Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng men...@gmail.com writes: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ Hl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of BIDMat+magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU backend available as an option. However, *ALL* users of MLlib could benefit (potentially tremendously) from using a well-tuned CPU-based BLAS implementation. Perhaps we should consider updating the mllib guide with a more complete section for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all performance comparisons that we discussed. It turns out that: BIDMat-cublasBIDMat MKL==netlib-mkl==netlib-openblas-compilednetlib-openblas-yum-repo= =netlib-cublasnetlib-blasf2jblas Below is the link to the spreadsheet with full results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx 378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM To: Evan R. Sparks Cc: Joseph Bradley; dev@spark.apache.org Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks, Evan! It seems that ticket was marked as duplicate though the original one discusses slightly different topic. I was able to link netlib with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a 60MB library. |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 | 1569,233228 | It turn out that pre-compiled MKL is faster than precompiled OpenBlas on my machine. Probably, I’ll add two more columns with locally compiled openblas and cuda. Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re
Re: Using CUDA within Spark / boosting linear algebra
with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.commailto: jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part of memory sounds like a much bigger undertaking. Joseph On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Thank you for explanation! I’ve watched the BIDMach presentation by John Canny and I am really inspired by his talk and comparisons with Spark MLlib. I am very interested to find out what will be better within Spark: BIDMat or netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark them? Currently I do benchmarks on artificial neural networks in batch mode. While it is not a “pure” test of linear algebra, it involves some other things
Re: Using CUDA within Spark / boosting linear algebra
Also, check the JNILoader output. Remember, for netlib-java to use your system libblas all you need to do is setup libblas.so.3 like any native application would expect. I haven't ever used the cublas real BLAS implementation, so I'd be interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check that all the runtime links are in order. Btw, I have some DGEMM wrappers in my netlib-java performance module... and I also planned to write more in MultiBLAS (until I mothballed the project for the hardware to catch up, which is probably has and now I just need a reason to look at it) On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote: Hey Sam, The running times are not big O estimates: The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I think there is something wrong with the netlib/cublas combination. Sam already mentioned that cuBLAS doesn't implement the CPU BLAS interfaces. I checked the CUDA doc and it seems that to use GPU BLAS through the CPU BLAS interface we need to use NVBLAS, which intercepts some Level 3 CPU BLAS calls (including GEMM). So we need to load nvblas.so first and then some CPU BLAS library in JNI. I wonder whether the setup was correct. Alexander, could you check whether GPU is used in the netlib-cublas experiments? You can tell it by watching CPU/GPU usage. Best, Xiangrui On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday sam.halli...@gmail.com wrote: Don't use big O estimates, always measure. It used to work back in the days when double multiplication was a bottleneck. The computation cost is effectively free on both the CPU and GPU and you're seeing pure copying costs. Also, I'm dubious that cublas is doing what you think it is. Can you link me to the source code for DGEMM? I show all of this in my talk, with explanations, I can't stress enough how much I recommend that you watch it if you want to understand high performance hardware acceleration for linear algebra :-) On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote: The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 1x1. I did the following on a g2.2xlarge instance with BIDMat: val n = 1 val f = rand(n, n) flip; f*f; val rf = flop flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop flip; g*g; val rgg = flop The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas path. But based on the result, the data copying overhead is definitely not as big as 20x at n = 1. Best, Xiangrui On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com wrote: I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level algorithms so is not a toolkit for applied mathematics research and development --- but it works well as a toolkit for higher level analysis (e.g. for analysts and practitioners). I believe BIDMat's approach is the best way to get performance out of GPU hardware at the moment but I also have strong evidence to suggest that the hardware will catch up and the memory transfer costs between CPU/GPU will disappear meaning that there will be no need for custom GPU kernel implementations. i.e. please continue to use BLAS primitives when writing new algorithms and only go to the GPU for an alternative optimised implementation. Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer an API that looks like BLAS but takes pointers to special regions in the GPU memory region. Somebody has written a wrapper around CUDA to create a proper BLAS library but it only gives marginal performance over the CPU because of the memory transfer overhead. This slide from my talk http://fommil.github.io/scalax14/#/11/2 says it all. X axis is matrix size, Y axis is logarithmic time to do DGEMM. Black line is the cheating time for the GPU and the green line is after copying the memory to/from the GPU memory. APUs have the potential to eliminate the green line. Best regards, Sam Ulanov, Alexander alexander.ula...@hp.com writes: Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old
Re: Using CUDA within Spark / boosting linear algebra
Don't use big O estimates, always measure. It used to work back in the days when double multiplication was a bottleneck. The computation cost is effectively free on both the CPU and GPU and you're seeing pure copying costs. Also, I'm dubious that cublas is doing what you think it is. Can you link me to the source code for DGEMM? I show all of this in my talk, with explanations, I can't stress enough how much I recommend that you watch it if you want to understand high performance hardware acceleration for linear algebra :-) On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote: The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 1x1. I did the following on a g2.2xlarge instance with BIDMat: val n = 1 val f = rand(n, n) flip; f*f; val rf = flop flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop flip; g*g; val rgg = flop The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas path. But based on the result, the data copying overhead is definitely not as big as 20x at n = 1. Best, Xiangrui On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com wrote: I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level algorithms so is not a toolkit for applied mathematics research and development --- but it works well as a toolkit for higher level analysis (e.g. for analysts and practitioners). I believe BIDMat's approach is the best way to get performance out of GPU hardware at the moment but I also have strong evidence to suggest that the hardware will catch up and the memory transfer costs between CPU/GPU will disappear meaning that there will be no need for custom GPU kernel implementations. i.e. please continue to use BLAS primitives when writing new algorithms and only go to the GPU for an alternative optimised implementation. Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer an API that looks like BLAS but takes pointers to special regions in the GPU memory region. Somebody has written a wrapper around CUDA to create a proper BLAS library but it only gives marginal performance over the CPU because of the memory transfer overhead. This slide from my talk http://fommil.github.io/scalax14/#/11/2 says it all. X axis is matrix size, Y axis is logarithmic time to do DGEMM. Black line is the cheating time for the GPU and the green line is after copying the memory to/from the GPU memory. APUs have the potential to eliminate the green line. Best regards, Sam Ulanov, Alexander alexander.ula...@hp.com writes: Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men
Re: Using CUDA within Spark / boosting linear algebra
. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.commailto: jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe that is done in practice). Having Spark be aware of the GPU and using it as another part
Re: Using CUDA within Spark / boosting linear algebra
...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Thank you for explanation and useful link. I am going to build OpenBLAS, link it with Netlib-java and perform benchmark again. Do I understand correctly that BIDMat binaries contain statically linked Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having MKL BLAS installed on my server. If it is true, I wonder if it is OK because Intel sells this library. Nevertheless, it seems that in my case precompiled MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be on par with JNI overheads. Though, it might be interesting to link Netlib-java with Intel MKL, as you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) interested to compare their libraries. Best regards, Alexander From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:58 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky process (see, e.g. ATLAS), but we found that on relatively modern Xeon chips, OpenBLAS builds quickly and yields performance competitive with MKL. To make sure the right library is getting used, you have to make sure it's first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. For some examples of getting netlib-java setup on an ec2 node and some example benchmarking code we ran a while back, see: https://github.com/shivaram/matrix-bench In particular - build-openblas-ec2.sh shows you how to build the library and set up symlinks correctly, and scala/run-netlib.sh shows you how to get the path setup and get that library picked up by netlib-java. In this way - you could probably get cuBLAS set up to be used by netlib-java as well. - Evan On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Evan, could you elaborate on how to force BIDMat and netlib-java to force loading the right blas? For netlib, I there are few JVM flags, such as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can force it to use Java implementation. Not sure I understand how to force use a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto: evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Getting breeze to pick up the right blas library is critical for performance. I recommend using OpenBLAS (or MKL, if you already have it). It might make sense to force BIDMat to use the same underlying BLAS library as well. On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Evan, Joseph I did few matrix multiplication test and BIDMat seems to be ~10x faster than netlib-java+breeze (sorry for weird table formatting): |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| Breeze+Netlib-java f2jblas | +---+ |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 | Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, Scala 2.11. Later I will make tests with Cuda. I need to install new Cuda version for this purpose. Do you have any ideas why breeze-netlib with native blas is so much slower than BIDMat MKL? Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.commailto: jos...@databricks.com] Sent: Thursday, February 05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting. Small comment: Concerning your question earlier about keeping data stored on the GPU rather than having to move it between main memory and GPU memory on each iteration, I would guess this would be critical to getting good performance. If you could do multiple local iterations before aggregating results, then the cost of data movement to the GPU could be amortized (and I believe
RE: Using CUDA within Spark / boosting linear algebra
I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level algorithms so is not a toolkit for applied mathematics research and development --- but it works well as a toolkit for higher level analysis (e.g. for analysts and practitioners). I believe BIDMat's approach is the best way to get performance out of GPU hardware at the moment but I also have strong evidence to suggest that the hardware will catch up and the memory transfer costs between CPU/GPU will disappear meaning that there will be no need for custom GPU kernel implementations. i.e. please continue to use BLAS primitives when writing new algorithms and only go to the GPU for an alternative optimised implementation. Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer an API that looks like BLAS but takes pointers to special regions in the GPU memory region. Somebody has written a wrapper around CUDA to create a proper BLAS library but it only gives marginal performance over the CPU because of the memory transfer overhead. This slide from my talk http://fommil.github.io/scalax14/#/11/2 says it all. X axis is matrix size, Y axis is logarithmic time to do DGEMM. Black line is the cheating time for the GPU and the green line is after copying the memory to/from the GPU memory. APUs have the potential to eliminate the green line. Best regards, Sam Ulanov, Alexander alexander.ula...@hp.com writes: Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need to see the results. Ignoring this method will make you believe that your GPU is thousands of times faster than it really is. Again, jump to the end of my talk for graphs and more discussion especially the bit about me being keen on funding to investigate APU hardware further ;-) (I believe it will solve the problem) On 26 Feb 2015 21:16, Xiangrui Meng men...@gmail.commailto:men...@gmail.com wrote: Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.commailto:jos...@databricks.com wrote: Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019 On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks evan.spa...@gmail.commailto:evan.spa...@gmail.com wrote: Thanks for compiling all the data and running these benchmarks, Alex. The big takeaways here can be seen with this chart: https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119format=interactive 1) A properly configured GPU matrix multiply implementation (e.g. BIDMat+GPU) can provide substantial (but less than an order of magnitude) benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or netlib-java+openblas-compiled). 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse than a well-tuned CPU implementation, particularly for larger matrices. (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this basically agrees with the authors own benchmarks ( https://github.com/fommil/netlib-java) I think that most of our users are in a situation where using GPUs may not be practical - although we could consider having a good GPU