Re: Using CUDA within Spark / boosting linear algebra

Sam Halliday Wed, 25 Mar 2015 15:05:17 -0700

If you write it up I'll add it to the netlib-java wiki :-)

BTW, does it automatically flip between cpu/GPU? I've a project called
MultiBLAS which was going to do this, it should be easy (but boring to
write)
On 25 Mar 2015 22:00, "Evan R. Sparks" <[email protected]> wrote:


> Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
> too good... did you check the results for correctness? - also, is it
> possible that the "unified memory model" of nvblas is somehow hiding pci
> transfer time?)
>
> this last bit (getting nvblas + netlib-java to play together) sounds like
> it's non-trivial and took you a while to figure out! Would you mind posting
> a gist or something of maybe the shell scripts/exports you used to make
> this work - I can imagine it being highly useful for others in the future.
>
> Thanks!
> Evan
>
> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <
> [email protected]> wrote:
>
>> Hi again,
>>
>> I finally managed to use nvblas within Spark+netlib-java. It has
>> exceptional performance for big matrices with Double, faster than
>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>> original nvblas presentation on GPU conf 2013 (slide 21):
>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>
>> My results:
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Just in case, these tests are not for generalization of performance of
>> different libraries. I just want to pick a library that does at best dense
>> matrices multiplication for my task.
>>
>> P.S. My previous issue with nvblas was the following: it has Fortran blas
>> functions, at the same time netlib-java uses C cblas functions. So, one
>> needs cblas shared library to use nvblas through netlib-java. Fedora does
>> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
>> could not use cblas from Atlas or Openblas because they link to their
>> implementation and not to Fortran blas.
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Ulanov, Alexander
>> Sent: Tuesday, March 24, 2015 6:57 PM
>> To: Sam Halliday
>> Cc: [email protected]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi,
>>
>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>> should replace current blas functions calls after executing LD_PRELOAD as
>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>> changes to netlib-java. It seems to work for simple Java example, but I
>> cannot make it work with Spark. I run the following:
>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                       GPU
>> Memory |
>> |  GPU       PID  Type  Process name                               Usage
>>     |
>>
>> |=============================================================================|
>> |    0      8873    C   bash
>> 39MiB |
>> |    0      8910    C   /usr/lib/jvm/java-1.7.0/bin/java
>> 39MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> In Spark shell I do matrix multiplication and see the following:
>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>> So I am sure that netlib-native is loaded and cblas supposedly used.
>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>> used and 0% of GPU used. I also checked different matrix sizes, from
>> 100x100 to 12000x12000
>>
>> Could you suggest might the LD_PRELOAD not affect Spark shell?
>>
>> Best regards, Alexander
>>
>>
>>
>> From: Sam Halliday [mailto:[email protected]]
>> Sent: Monday, March 09, 2015 6:01 PM
>> To: Ulanov, Alexander
>> Cc: [email protected]; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>>
>> Thanks so much for following up on this!
>>
>> Hmm, I wonder if we should have a concerted effort to chart performance
>> on various pieces of hardware...
>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <[email protected]<mailto:
>> [email protected]>> wrote:
>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
>> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
>> support of Double in the current source code), did the test with BIDMat and
>> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>>
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Best regards, Alexander
>>
>> -----Original Message-----
>> From: Sam Halliday [mailto:[email protected]<mailto:
>> [email protected]>]
>> Sent: Tuesday, March 03, 2015 1:54 PM
>> To: Xiangrui Meng; Joseph Bradley
>> Cc: Evan R. Sparks; Ulanov, Alexander; [email protected]<mailto:
>> [email protected]>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> BTW, is anybody on this list going to the London Meetup in a few weeks?
>>
>>
>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>>
>> Would be nice to meet other people working on the guts of Spark! :-)
>>
>>
>> Xiangrui Meng <[email protected]<mailto:[email protected]>> writes:
>>
>> > Hey Alexander,
>> >
>> > I don't quite understand the part where netlib-cublas is about 20x
>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> > with netlib-java?
>> >
>> > CC'ed Sam, the author of netlib-java.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <[email protected]
>> <mailto:[email protected]>> wrote:
>> >> Better documentation for linking would be very helpful!  Here's a JIRA:
>> >> https://issues.apache.org/jira/browse/SPARK-6019
>> >>
>> >>
>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> >> <[email protected]<mailto:[email protected]>>
>> >> wrote:
>> >>
>> >>> Thanks for compiling all the data and running these benchmarks,
>> >>> Alex. The big takeaways here can be seen with this chart:
>> >>>
>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>>
>> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >>> BIDMat+GPU) can provide substantial (but less than an order of
>> >>> BIDMat+magnitude)
>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >>> netlib-java+openblas-compiled).
>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>> >>> worse than a well-tuned CPU implementation, particularly for larger
>> matrices.
>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> >>> basically agrees with the authors own benchmarks (
>> >>> https://github.com/fommil/netlib-java)
>> >>>
>> >>> I think that most of our users are in a situation where using GPUs
>> >>> may not be practical - although we could consider having a good GPU
>> >>> backend available as an option. However, *ALL* users of MLlib could
>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based
>> >>> BLAS implementation. Perhaps we should consider updating the mllib
>> >>> guide with a more complete section for enabling high performance
>> >>> binaries on OSX and Linux? Or better, figure out a way for the
>> >>> system to fetch these automatically.
>> >>>
>> >>> - Evan
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >>> [email protected]<mailto:[email protected]>> wrote:
>> >>>
>> >>>> Just to summarize this thread, I was finally able to make all
>> >>>> performance comparisons that we discussed. It turns out that:
>> >>>> BIDMat-cublas>>BIDMat
>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>> >>>> =netlib-cublas>netlib-blas>f2jblas
>> >>>>
>> >>>> Below is the link to the spreadsheet with full results.
>> >>>>
>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>>
>> >>>> One thing still needs exploration: does BIDMat-cublas perform
>> >>>> copying to/from machine’s RAM?
>> >>>>
>> >>>> -----Original Message-----
>> >>>> From: Ulanov, Alexander
>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>>> To: Evan R. Sparks
>> >>>> Cc: Joseph Bradley;
>> >>>> [email protected]<mailto:[email protected]>
>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>> >>>> the original one discusses slightly different topic. I was able to
>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>> >>>> statically linked inside a 60MB library.
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >>>> |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>> >>>> 1569,233228 |
>> >>>>
>> >>>> It turn out that pre-compiled MKL is faster than precompiled
>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>> >>>> locally compiled openblas and cuda.
>> >>>>
>> >>>> Alexander
>> >>>>
>> >>>> From: Evan R. Sparks
>> >>>> [mailto:[email protected]<mailto:[email protected]>]
>> >>>> Sent: Monday, February 09, 2015 6:06 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [email protected]<mailto:[email protected]>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Great - perhaps we can move this discussion off-list and onto a
>> >>>> JIRA ticket? (Here's one:
>> >>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >>>>
>> >>>> It seems like this is going to be somewhat exploratory for a while
>> >>>> (and there's probably only a handful of us who really care about
>> >>>> fast linear
>> >>>> algebra!)
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for explanation and useful link. I am going to build
>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>> >>>>
>> >>>> Do I understand correctly that BIDMat binaries contain statically
>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
>> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
>> >>>> it seems that in my case precompiled MKL BLAS performs better than
>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed
>> to be on par with JNI overheads.
>> >>>>
>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>> >>>> Halliday
>> >>>> (Netlib-java) interested to compare their libraries.
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>]
>> >>>> Sent: Friday, February 06, 2015 5:58 PM
>> >>>>
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
>> >>>> from getting cache sizes, etc. set up correctly for your particular
>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
>> >>>> quickly and yields performance competitive with MKL.
>> >>>>
>> >>>> To make sure the right library is getting used, you have to make
>> >>>> sure it's first on the search path - export
>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >>>>
>> >>>> For some examples of getting netlib-java setup on an ec2 node and
>> >>>> some example benchmarking code we ran a while back, see:
>> >>>> https://github.com/shivaram/matrix-bench
>> >>>>
>> >>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh
>> >>>> shows you how to get the path setup and get that library picked up
>> by netlib-java.
>> >>>>
>> >>>> In this way - you could probably get cuBLAS set up to be used by
>> >>>> netlib-java as well.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>> wrote:
>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>> >>>> force loading the right blas? For netlib, I there are few JVM
>> >>>> flags, such as
>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>> >>>> so I can force it to use Java implementation. Not sure I understand
>> how to force use a specific blas (not specific wrapper for blas).
>> >>>>
>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
>> >>>> that netlib is using it.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>]
>> >>>> Sent: Friday, February 06, 2015 5:19 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Joseph Bradley;
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>
>> >>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Getting breeze to pick up the right blas library is critical for
>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
>> it).
>> >>>> It might make sense to force BIDMat to use the same underlying BLAS
>> >>>> library as well.
>> >>>>
>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>> wrote:
>> >>>> Hi Evan, Joseph
>> >>>>
>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >>>> faster than netlib-java+breeze (sorry for weird table formatting):
>> >>>>
>> >>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >>>> |native_system_linux_x86-64|
>> >>>> Breeze+Netlib-java f2jblas |
>> >>>>
>> +-----------------------------------------------------------------------+
>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228
>> >>>> ||
>> >>>>
>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>> >>>> 19 Linux, Scala 2.11.
>> >>>>
>> >>>> Later I will make tests with Cuda. I need to install new Cuda
>> >>>> version for this purpose.
>> >>>>
>> >>>> Do you have any ideas why breeze-netlib with native blas is so much
>> >>>> slower than BIDMat MKL?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Joseph Bradley [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>]
>> >>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: Evan R. Sparks;
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> Hi Alexander,
>> >>>>
>> >>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >>>> Concerning your question earlier about keeping data stored on the
>> >>>> GPU rather than having to move it between main memory and GPU
>> >>>> memory on each iteration, I would guess this would be critical to
>> >>>> getting good performance.  If you could do multiple local
>> >>>> iterations before aggregating results, then the cost of data
>> >>>> movement to the GPU could be amortized (and I believe that is done
>> >>>> in practice).  Having Spark be aware of the GPU and using it as
>> another part of memory sounds like a much bigger undertaking.
>> >>>>
>> >>>> Joseph
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>> wrote:
>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>> >>>> John Canny and I am really inspired by his talk and comparisons with
>> Spark MLlib.
>> >>>>
>> >>>> I am very interested to find out what will be better within Spark:
>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a
>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
>> >>>> neural networks in batch mode. While it is not a “pure” test of
>> >>>> linear algebra, it involves some other things that are essential to
>> machine learning.
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>]
>> >>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc:
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>> >>>> netlib-java+data
>> >>>> layout and fewer levels of indirection - it's definitely a
>> >>>> worthwhile experiment to run. The main speedups I've seen from
>> >>>> using it come from highly optimized GPU code for linear algebra. I
>> >>>> know that in the past Canny has gone as far as to write custom GPU
>> >>>> kernels for performance-critical regions of code.[1]
>> >>>>
>> >>>> BIDMach is highly optimized for single node performance or
>> >>>> performance on small clusters.[2] Once data doesn't fit easily in
>> >>>> GPU memory (or can be batched in that way) the performance tends to
>> >>>> fall off. Canny argues for hardware/software codesign and as such
>> >>>> prefers machine configurations that are quite different than what
>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and
>> 4 GPUs.
>> >>>>
>> >>>> In contrast, MLlib was designed for horizontal scalability on
>> >>>> commodity clusters and works best on very big datasets - order of
>> terabytes.
>> >>>>
>> >>>> For the most part, these projects developed concurrently to address
>> >>>> slightly different use cases. That said, there may be bits of
>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>> >>>> careful about maintaining cross-language compatibility for our Java
>> >>>> and Python-users, though.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>><mailto:
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>>> wrote:
>> >>>> Hi Evan,
>> >>>>
>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>> >>>> you know what makes them faster than netlib-java?
>> >>>>
>> >>>> The same group has BIDMach library that implements machine
>> >>>> learning. For some examples they use Caffe convolutional neural
>> >>>> network library owned by another group in Berkeley. Could you
>> >>>> elaborate on how these all might be connected with Spark Mllib? If
>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
>> optimization and learning?
>> >>>>
>> >>>> Best regards, Alexander
>> >>>>
>> >>>> From: Evan R. Sparks [mailto:[email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>><mailto:
>> [email protected]<mailto:[email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>>]
>> >>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >>>> To: Ulanov, Alexander
>> >>>> Cc: [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>><mailto:
>> >>>> [email protected]<mailto:[email protected]><mailto:dev@spark.
>> >>>> apache.org<mailto:[email protected]>>>
>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>
>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >>>> blas in many cases.
>> >>>>
>> >>>> You might consider taking a look at the codepaths that BIDMat (
>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >>>> optimizing to make this work really fast from Scala. I've run it on
>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster
>> at matrix multiply.
>> >>>> There are a lot of layers of indirection here and you really want
>> >>>> to avoid data copying as much as possible.
>> >>>>
>> >>>> We could also consider swapping out BIDMat for Breeze, but that
>> >>>> would be a big project and if we can figure out how to get
>> >>>> breeze+cublas to comparable performance that would be a big win.
>> >>>>
>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>><mailto:
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> [email protected]<mailto:[email protected]>>>> wrote:
>> >>>> Dear Spark developers,
>> >>>>
>> >>>> I am exploring how to make linear algebra operations faster within
>> Spark.
>> >>>> One way of doing this is to use Scala Breeze library that is
>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java
>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
>> >>>> and LAPACK native binaries if they are available on the worker
>> >>>> node. It also has its own optimized Java implementation of BLAS. It
>> >>>> is worth mentioning, that native binaries provide better performance
>> only for BLAS level 3, i.e.
>> >>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >>>> This is confirmed by GEMM test on Netlib-java page
>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>> >>>> experiments with training of artificial neural network
>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >>>> However, I would like to boost performance more.
>> >>>>
>> >>>> GPU is supposed to work fast with linear algebra and there is
>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
>> >>>> server with Nvidia GPU and I was able to do the following. I linked
>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put
>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>> >>>> performance measurements with regards to artificial neural network
>> >>>> batch learning in Spark MLlib that involves matrix-matrix
>> >>>> multiplications. It turns out that for matrices of size less than
>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
>> >>>> slower for bigger matrices. It worth mentioning that it is was not a
>> test for ONLY multiplication since there are other operations involved.
>> >>>> One of the reasons for slowdown might be the overhead of copying
>> >>>> the matrices from computer memory to graphic card memory and back.
>> >>>>
>> >>>> So, few questions:
>> >>>> 1) Do these results with CUDA make sense?
>> >>>> 2) If the problem is with copy overhead, are there any libraries
>> >>>> that allow to force intermediate results to stay in graphic card
>> >>>> memory thus removing the overhead?
>> >>>> 3) Any other options to speed-up linear algebra in Spark?
>> >>>>
>> >>>> Thank you, Alexander
>> >>>>
>> >>>> -------------------------------------------------------------------
>> >>>> -- To unsubscribe, e-mail: [email protected]<mailto:
>> [email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]
>> >>>> e.org>><mailto:[email protected]<mailto:dev-unsubscribe@sp
>> >>>> ark.apac> he.org<http://he.org>
>> >>>> <mailto:[email protected]<mailto:dev-unsubscribe@spa
>> >>>> rk.apache.org>>> For additional commands, e-mail:
>> >>>> [email protected]<mailto:[email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>><mailto:
>> [email protected]<mailto:[email protected]><mailto:
>> >>>> [email protected]<mailto:[email protected]>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>>
>> --
>> Best regards,
>> Sam
>>
>
>

Re: Using CUDA within Spark / boosting linear algebra

Reply via email to