As everyone suggested, the results were too good to be true, so I double-checked them. It turns that nvblas did not do multiplication due to parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My previously posted results with nvblas are matrices copying only. The default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I handpicked other values that worked. As a result, netlib+nvblas is on par with BIDMat-cuda. As promised, I am going to post a how-to for nvblas configuration.
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -----Original Message----- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi again, I finally managed to use nvblas within Spark+netlib-java. It has exceptional performance for big matrices with Double, faster than BIDMat-cuda with Float. But for smaller matrices, if you will copy them to/from GPU, OpenBlas or MKL might be a better choice. This correlates with original nvblas presentation on GPU conf 2013 (slide 21): http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf My results: https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Just in case, these tests are not for generalization of performance of different libraries. I just want to pick a library that does at best dense matrices multiplication for my task. P.S. My previous issue with nvblas was the following: it has Fortran blas functions, at the same time netlib-java uses C cblas functions. So, one needs cblas shared library to use nvblas through netlib-java. Fedora does not have cblas (but Debian and Ubuntu have), so I needed to compile it. I could not use cblas from Atlas or Openblas because they link to their implementation and not to Fortran blas. Best regards, Alexander -----Original Message----- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib-java from Spark. nvblas functions should replace current blas functions calls after executing LD_PRELOAD as suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to netlib-java. It seems to work for simple Java example, but I cannot make it work with Spark. I run the following: export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 8873 C bash 39MiB | | 0 8910 C /usr/lib/jvm/java-1.7.0/bin/java 39MiB | +-----------------------------------------------------------------------------+ In Spark shell I do matrix multiplication and see the following: 15/03/25 06:48:01 INFO JniLoader: successfully loaded /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so So I am sure that netlib-native is loaded and cblas supposedly used. However, matrix multiplication does executes on CPU since I see 16% of CPU used and 0% of GPU used. I also checked different matrix sizes, from 100x100 to 12000x12000 Could you suggest might the LD_PRELOAD not affect Spark shell? Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much for following up on this! Hmm, I wonder if we should have a concerted effort to chart performance on various pieces of hardware... On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the support of Double in the current source code), did the test with BIDMat and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing Best regards, Alexander -----Original Message----- From: Sam Halliday [mailto:sam.halli...@gmail.com<mailto:sam.halli...@gmail.com>] Sent: Tuesday, March 03, 2015 1:54 PM To: Xiangrui Meng; Joseph Bradley Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org<mailto:dev@spark.apache.org> Subject: Re: Using CUDA within Spark / boosting linear algebra BTW, is anybody on this list going to the London Meetup in a few weeks? https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community Would be nice to meet other people working on the guts of Spark! :-) Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>> writes: > Hey Alexander, > > I don't quite understand the part where netlib-cublas is about 20x > slower than netlib-openblas. What is the overhead of using a GPU BLAS > with netlib-java? > > CC'ed Sam, the author of netlib-java. > > Best, > Xiangrui > > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley > <jos...@databricks.com<mailto:jos...@databricks.com>> wrote: >> Better documentation for linking would be very helpful! Here's a JIRA: >> https://issues.apache.org/jira/browse/SPARK-6019 >> >> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks >> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>> >> wrote: >> >>> Thanks for compiling all the data and running these benchmarks, >>> Alex. The big takeaways here can be seen with this chart: >>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive >>> >>> 1) A properly configured GPU matrix multiply implementation (e.g. >>> BIDMat+GPU) can provide substantial (but less than an order of >>> BIDMat+magnitude) >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or >>> netlib-java+openblas-compiled). >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude >>> worse than a well-tuned CPU implementation, particularly for larger >>> matrices. >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this >>> basically agrees with the authors own benchmarks ( >>> https://github.com/fommil/netlib-java) >>> >>> I think that most of our users are in a situation where using GPUs >>> may not be practical - although we could consider having a good GPU >>> backend available as an option. However, *ALL* users of MLlib could >>> benefit (potentially tremendously) from using a well-tuned CPU-based >>> BLAS implementation. Perhaps we should consider updating the mllib >>> guide with a more complete section for enabling high performance >>> binaries on OSX and Linux? Or better, figure out a way for the >>> system to fetch these automatically. >>> >>> - Evan >>> >>> >>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >>> >>>> Just to summarize this thread, I was finally able to make all >>>> performance comparisons that we discussed. It turns out that: >>>> BIDMat-cublas>>BIDMat >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo= >>>> =netlib-cublas>netlib-blas>f2jblas >>>> >>>> Below is the link to the spreadsheet with full results. >>>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx >>>> 378T9J5r7kwKSPkY/edit?usp=sharing >>>> >>>> One thing still needs exploration: does BIDMat-cublas perform >>>> copying to/from machine’s RAM? >>>> >>>> -----Original Message----- >>>> From: Ulanov, Alexander >>>> Sent: Tuesday, February 10, 2015 2:12 PM >>>> To: Evan R. Sparks >>>> Cc: Joseph Bradley; >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra >>>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though >>>> the original one discusses slightly different topic. I was able to >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is >>>> statically linked inside a 60MB library. >>>> >>>> |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | >>>> +-----------------------------------------------------------------------+ >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 >>>> |1,638475459 | >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 | >>>> 1569,233228 | >>>> >>>> It turn out that pre-compiled MKL is faster than precompiled >>>> OpenBlas on my machine. Probably, I’ll add two more columns with >>>> locally compiled openblas and cuda. >>>> >>>> Alexander >>>> >>>> From: Evan R. Sparks >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>] >>>> Sent: Monday, February 09, 2015 6:06 PM >>>> To: Ulanov, Alexander >>>> Cc: Joseph Bradley; >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> Great - perhaps we can move this discussion off-list and onto a >>>> JIRA ticket? (Here's one: >>>> https://issues.apache.org/jira/browse/SPARK-5705) >>>> >>>> It seems like this is going to be somewhat exploratory for a while >>>> (and there's probably only a handful of us who really care about >>>> fast linear >>>> algebra!) >>>> >>>> - Evan >>>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >>>> wrote: >>>> Hi Evan, >>>> >>>> Thank you for explanation and useful link. I am going to build >>>> OpenBLAS, link it with Netlib-java and perform benchmark again. >>>> >>>> Do I understand correctly that BIDMat binaries contain statically >>>> linked Intel MKL BLAS? It might be the reason why I am able to run >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I >>>> wonder if it is OK because Intel sells this library. Nevertheless, >>>> it seems that in my case precompiled MKL BLAS performs better than >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed to be >>>> on par with JNI overheads. >>>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL, >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam >>>> Halliday >>>> (Netlib-java) interested to compare their libraries. >>>> >>>> Best regards, Alexander >>>> >>>> From: Evan R. Sparks >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>>> Sent: Friday, February 06, 2015 5:58 PM >>>> >>>> To: Ulanov, Alexander >>>> Cc: Joseph Bradley; >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>>> apache.org<mailto:dev@spark.apache.org>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes >>>> from getting cache sizes, etc. set up correctly for your particular >>>> hardware - this is often a very tricky process (see, e.g. ATLAS), >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds >>>> quickly and yields performance competitive with MKL. >>>> >>>> To make sure the right library is getting used, you have to make >>>> sure it's first on the search path - export >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. >>>> >>>> For some examples of getting netlib-java setup on an ec2 node and >>>> some example benchmarking code we ran a while back, see: >>>> https://github.com/shivaram/matrix-bench >>>> >>>> In particular - build-openblas-ec2.sh shows you how to build the >>>> library and set up symlinks correctly, and scala/run-netlib.sh >>>> shows you how to get the path setup and get that library picked up by >>>> netlib-java. >>>> >>>> In this way - you could probably get cuBLAS set up to be used by >>>> netlib-java as well. >>>> >>>> - Evan >>>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >>>> wrote: >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to >>>> force loading the right blas? For netlib, I there are few JVM >>>> flags, such as >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, >>>> so I can force it to use Java implementation. Not sure I understand how to >>>> force use a specific blas (not specific wrapper for blas). >>>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose >>>> that netlib is using it. >>>> >>>> From: Evan R. Sparks >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>>> Sent: Friday, February 06, 2015 5:19 PM >>>> To: Ulanov, Alexander >>>> Cc: Joseph Bradley; >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>>> apache.org<mailto:dev@spark.apache.org>> >>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> Getting breeze to pick up the right blas library is critical for >>>> performance. I recommend using OpenBLAS (or MKL, if you already have it). >>>> It might make sense to force BIDMat to use the same underlying BLAS >>>> library as well. >>>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >>>> wrote: >>>> Hi Evan, Joseph >>>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x >>>> faster than netlib-java+breeze (sorry for weird table formatting): >>>> >>>> |A*B size | BIDMat MKL | Breeze+Netlib-java >>>> |native_system_linux_x86-64| >>>> Breeze+Netlib-java f2jblas | >>>> +-----------------------------------------------------------------------+ >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 >>>> || >>>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora >>>> 19 Linux, Scala 2.11. >>>> >>>> Later I will make tests with Cuda. I need to install new Cuda >>>> version for this purpose. >>>> >>>> Do you have any ideas why breeze-netlib with native blas is so much >>>> slower than BIDMat MKL? >>>> >>>> Best regards, Alexander >>>> >>>> From: Joseph Bradley >>>> [mailto:jos...@databricks.com<mailto:jos...@databricks.com><mailto: >>>> jos...@databricks.com<mailto:jos...@databricks.com>>] >>>> Sent: Thursday, February 05, 2015 5:29 PM >>>> To: Ulanov, Alexander >>>> Cc: Evan R. Sparks; >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>>> apache.org<mailto:dev@spark.apache.org>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> Hi Alexander, >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> Concerning your question earlier about keeping data stored on the >>>> GPU rather than having to move it between main memory and GPU >>>> memory on each iteration, I would guess this would be critical to >>>> getting good performance. If you could do multiple local >>>> iterations before aggregating results, then the cost of data >>>> movement to the GPU could be amortized (and I believe that is done >>>> in practice). Having Spark be aware of the GPU and using it as another >>>> part of memory sounds like a much bigger undertaking. >>>> >>>> Joseph >>>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >>>> wrote: >>>> Thank you for explanation! I’ve watched the BIDMach presentation by >>>> John Canny and I am really inspired by his talk and comparisons with Spark >>>> MLlib. >>>> >>>> I am very interested to find out what will be better within Spark: >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a >>>> fair way to benchmark them? Currently I do benchmarks on artificial >>>> neural networks in batch mode. While it is not a “pure” test of >>>> linear algebra, it involves some other things that are essential to >>>> machine learning. >>>> >>>> From: Evan R. Sparks >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>>> Sent: Thursday, February 05, 2015 1:29 PM >>>> To: Ulanov, Alexander >>>> Cc: >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>>> apache.org<mailto:dev@spark.apache.org>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to >>>> netlib-java+data >>>> layout and fewer levels of indirection - it's definitely a >>>> worthwhile experiment to run. The main speedups I've seen from >>>> using it come from highly optimized GPU code for linear algebra. I >>>> know that in the past Canny has gone as far as to write custom GPU >>>> kernels for performance-critical regions of code.[1] >>>> >>>> BIDMach is highly optimized for single node performance or >>>> performance on small clusters.[2] Once data doesn't fit easily in >>>> GPU memory (or can be batched in that way) the performance tends to >>>> fall off. Canny argues for hardware/software codesign and as such >>>> prefers machine configurations that are quite different than what >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 GPUs. >>>> >>>> In contrast, MLlib was designed for horizontal scalability on >>>> commodity clusters and works best on very big datasets - order of >>>> terabytes. >>>> >>>> For the most part, these projects developed concurrently to address >>>> slightly different use cases. That said, there may be bits of >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be >>>> careful about maintaining cross-language compatibility for our Java >>>> and Python-users, though. >>>> >>>> - Evan >>>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] - >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf >>>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> >>>> wrote: >>>> Hi Evan, >>>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do >>>> you know what makes them faster than netlib-java? >>>> >>>> The same group has BIDMach library that implements machine >>>> learning. For some examples they use Caffe convolutional neural >>>> network library owned by another group in Berkeley. Could you >>>> elaborate on how these all might be connected with Spark Mllib? If >>>> you take BIDMat for linear algebra why don’t you take BIDMach for >>>> optimization and learning? >>>> >>>> Best regards, Alexander >>>> >>>> From: Evan R. Sparks >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>] >>>> Sent: Thursday, February 05, 2015 12:09 PM >>>> To: Ulanov, Alexander >>>> Cc: >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto: >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>>> apache.org<mailto:dev@spark.apache.org>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU >>>> blas in many cases. >>>> >>>> You might consider taking a look at the codepaths that BIDMat ( >>>> https://github.com/BIDData/BIDMat) takes and comparing them to >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work >>>> optimizing to make this work really fast from Scala. I've run it on >>>> my laptop and compared to MKL and in certain cases it's 10x faster at >>>> matrix multiply. >>>> There are a lot of layers of indirection here and you really want >>>> to avoid data copying as much as possible. >>>> >>>> We could also consider swapping out BIDMat for Breeze, but that >>>> would be a big project and if we can figure out how to get >>>> breeze+cublas to comparable performance that would be a big win. >>>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> >>>> wrote: >>>> Dear Spark developers, >>>> >>>> I am exploring how to make linear algebra operations faster within Spark. >>>> One way of doing this is to use Scala Breeze library that is >>>> bundled with Spark. For matrix operations, it employs Netlib-java >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms) >>>> and LAPACK native binaries if they are available on the worker >>>> node. It also has its own optimized Java implementation of BLAS. It >>>> is worth mentioning, that native binaries provide better performance only >>>> for BLAS level 3, i.e. >>>> matrix-matrix operations or general matrix multiplication (GEMM). >>>> This is confirmed by GEMM test on Netlib-java page >>>> https://github.com/fommil/netlib-java. I also confirmed it with my >>>> experiments with training of artificial neural network >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952. >>>> However, I would like to boost performance more. >>>> >>>> GPU is supposed to work fast with linear algebra and there is >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux >>>> server with Nvidia GPU and I was able to do the following. I linked >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put >>>> it into Spark, so Breeze/Netlib is using it. Then I did some >>>> performance measurements with regards to artificial neural network >>>> batch learning in Spark MLlib that involves matrix-matrix >>>> multiplications. It turns out that for matrices of size less than >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes >>>> slower for bigger matrices. It worth mentioning that it is was not a test >>>> for ONLY multiplication since there are other operations involved. >>>> One of the reasons for slowdown might be the overhead of copying >>>> the matrices from computer memory to graphic card memory and back. >>>> >>>> So, few questions: >>>> 1) Do these results with CUDA make sense? >>>> 2) If the problem is with copy overhead, are there any libraries >>>> that allow to force intermediate results to stay in graphic card >>>> memory thus removing the overhead? >>>> 3) Any other options to speed-up linear algebra in Spark? >>>> >>>> Thank you, Alexander >>>> >>>> ------------------------------------------------------------------- >>>> -- To unsubscribe, e-mail: >>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org><mailto: >>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spark.apach >>>> e.org>><mailto:dev-unsubscr...@spark.apac<mailto:dev-unsubscribe@sp >>>> ark.apac> he.org<http://he.org> >>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spa >>>> rk.apache.org>>> For additional commands, e-mail: >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>> >>>> >>>> >>>> >>>> >>> -- Best regards, Sam