Sam, whould it be easier to hack netlib-java to allow multiple (configurable) library contexts? And so enable 3rd party configurations and optimizers to make their own choices until then?
On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday <sam.halli...@gmail.com> wrote: > Yeah, MultiBLAS... it is dynamic. > > Except, I haven't written it yet :-P > On 25 Mar 2015 22:06, "Ulanov, Alexander" <alexander.ula...@hp.com> wrote: > >> Netlib knows nothing about GPU (or CPU), it just uses cblas symbols >> from the provided libblas.so.3 library at the runtime. So, you can switch >> at the runtime by providing another library. Sam, please suggest if there >> is another way. >> >> >> >> *From:* Dmitriy Lyubimov [mailto:dlie...@gmail.com] >> *Sent:* Wednesday, March 25, 2015 2:55 PM >> *To:* Ulanov, Alexander >> *Cc:* Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; >> Evan R. Sparks; jfcanny >> *Subject:* Re: Using CUDA within Spark / boosting linear algebra >> >> >> >> Alexander, >> >> >> >> does using netlib imply that one cannot switch between CPU and GPU blas >> alternatives at will at the same time? the choice is always determined by >> linking aliternatives to libblas.so, right? >> >> >> >> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander < >> alexander.ula...@hp.com> wrote: >> >> Hi again, >> >> I finally managed to use nvblas within Spark+netlib-java. It has >> exceptional performance for big matrices with Double, faster than >> BIDMat-cuda with Float. But for smaller matrices, if you will copy them >> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with >> original nvblas presentation on GPU conf 2013 (slide 21): >> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf >> >> My results: >> >> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >> >> Just in case, these tests are not for generalization of performance of >> different libraries. I just want to pick a library that does at best dense >> matrices multiplication for my task. >> >> P.S. My previous issue with nvblas was the following: it has Fortran blas >> functions, at the same time netlib-java uses C cblas functions. So, one >> needs cblas shared library to use nvblas through netlib-java. Fedora does >> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I >> could not use cblas from Atlas or Openblas because they link to their >> implementation and not to Fortran blas. >> >> Best regards, Alexander >> >> -----Original Message----- >> From: Ulanov, Alexander >> >> Sent: Tuesday, March 24, 2015 6:57 PM >> To: Sam Halliday >> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks >> Subject: RE: Using CUDA within Spark / boosting linear algebra >> >> Hi, >> >> I am trying to use nvblas with netlib-java from Spark. nvblas functions >> should replace current blas functions calls after executing LD_PRELOAD as >> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any >> changes to netlib-java. It seems to work for simple Java example, but I >> cannot make it work with Spark. I run the following: >> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 >> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell >> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: >> >> +-----------------------------------------------------------------------------+ >> | Processes: GPU >> Memory | >> | GPU PID Type Process name Usage >> | >> >> |=============================================================================| >> | 0 8873 C bash >> 39MiB | >> | 0 8910 C /usr/lib/jvm/java-1.7.0/bin/java >> 39MiB | >> >> +-----------------------------------------------------------------------------+ >> >> In Spark shell I do matrix multiplication and see the following: >> 15/03/25 06:48:01 INFO JniLoader: successfully loaded >> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so >> So I am sure that netlib-native is loaded and cblas supposedly used. >> However, matrix multiplication does executes on CPU since I see 16% of CPU >> used and 0% of GPU used. I also checked different matrix sizes, from >> 100x100 to 12000x12000 >> >> Could you suggest might the LD_PRELOAD not affect Spark shell? >> >> Best regards, Alexander >> >> >> >> From: Sam Halliday [mailto:sam.halli...@gmail.com] >> Sent: Monday, March 09, 2015 6:01 PM >> To: Ulanov, Alexander >> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks >> Subject: RE: Using CUDA within Spark / boosting linear algebra >> >> >> Thanks so much for following up on this! >> >> Hmm, I wonder if we should have a concerted effort to chart performance >> on various pieces of hardware... >> On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ula...@hp.com<mailto: >> alexander.ula...@hp.com>> wrote: >> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the >> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the >> support of Double in the current source code), did the test with BIDMat and >> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. >> >> >> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >> >> Best regards, Alexander >> >> -----Original Message----- >> From: Sam Halliday [mailto:sam.halli...@gmail.com<mailto: >> sam.halli...@gmail.com>] >> Sent: Tuesday, March 03, 2015 1:54 PM >> To: Xiangrui Meng; Joseph Bradley >> Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org<mailto: >> dev@spark.apache.org> >> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> BTW, is anybody on this list going to the London Meetup in a few weeks? >> >> >> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community >> >> Would be nice to meet other people working on the guts of Spark! :-) >> >> >> Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>> writes: >> >> > Hey Alexander, >> > >> > I don't quite understand the part where netlib-cublas is about 20x >> > slower than netlib-openblas. What is the overhead of using a GPU BLAS >> > with netlib-java? >> > >> > CC'ed Sam, the author of netlib-java. >> > >> > Best, >> > Xiangrui >> > >> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <jos...@databricks.com >> <mailto:jos...@databricks.com>> wrote: >> >> Better documentation for linking would be very helpful! Here's a JIRA: >> >> https://issues.apache.org/jira/browse/SPARK-6019 >> >> >> >> >> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks >> >> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>> >> >> wrote: >> >> >> >>> Thanks for compiling all the data and running these benchmarks, >> >>> Alex. The big takeaways here can be seen with this chart: >> >>> >> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ >> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive >> >>> >> >>> 1) A properly configured GPU matrix multiply implementation (e.g. >> >>> BIDMat+GPU) can provide substantial (but less than an order of >> >>> BIDMat+magnitude) >> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or >> >>> netlib-java+openblas-compiled). >> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude >> >>> worse than a well-tuned CPU implementation, particularly for larger >> matrices. >> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this >> >>> basically agrees with the authors own benchmarks ( >> >>> https://github.com/fommil/netlib-java) >> >>> >> >>> I think that most of our users are in a situation where using GPUs >> >>> may not be practical - although we could consider having a good GPU >> >>> backend available as an option. However, *ALL* users of MLlib could >> >>> benefit (potentially tremendously) from using a well-tuned CPU-based >> >>> BLAS implementation. Perhaps we should consider updating the mllib >> >>> guide with a more complete section for enabling high performance >> >>> binaries on OSX and Linux? Or better, figure out a way for the >> >>> system to fetch these automatically. >> >>> >> >>> - Evan >> >>> >> >>> >> >>> >> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < >> >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >> >>> >> >>>> Just to summarize this thread, I was finally able to make all >> >>>> performance comparisons that we discussed. It turns out that: >> >>>> BIDMat-cublas>>BIDMat >> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo= >> >>>> =netlib-cublas>netlib-blas>f2jblas >> >>>> >> >>>> Below is the link to the spreadsheet with full results. >> >>>> >> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx >> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing >> >>>> >> >>>> One thing still needs exploration: does BIDMat-cublas perform >> >>>> copying to/from machine’s RAM? >> >>>> >> >>>> -----Original Message----- >> >>>> From: Ulanov, Alexander >> >>>> Sent: Tuesday, February 10, 2015 2:12 PM >> >>>> To: Evan R. Sparks >> >>>> Cc: Joseph Bradley; >> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra >> >>>> >> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though >> >>>> the original one discusses slightly different topic. I was able to >> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is >> >>>> statically linked inside a 60MB library. >> >>>> >> >>>> |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| >> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | >> >>>> >> +-----------------------------------------------------------------------+ >> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | >> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 >> >>>> |1,638475459 | >> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 | >> >>>> 1569,233228 | >> >>>> >> >>>> It turn out that pre-compiled MKL is faster than precompiled >> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with >> >>>> locally compiled openblas and cuda. >> >>>> >> >>>> Alexander >> >>>> >> >>>> From: Evan R. Sparks >> >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>] >> >>>> Sent: Monday, February 09, 2015 6:06 PM >> >>>> To: Ulanov, Alexander >> >>>> Cc: Joseph Bradley; >> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >>>> >> >>>> Great - perhaps we can move this discussion off-list and onto a >> >>>> JIRA ticket? (Here's one: >> >>>> https://issues.apache.org/jira/browse/SPARK-5705) >> >>>> >> >>>> It seems like this is going to be somewhat exploratory for a while >> >>>> (and there's probably only a handful of us who really care about >> >>>> fast linear >> >>>> algebra!) >> >>>> >> >>>> - Evan >> >>>> >> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >> >>>> Hi Evan, >> >>>> >> >>>> Thank you for explanation and useful link. I am going to build >> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again. >> >>>> >> >>>> Do I understand correctly that BIDMat binaries contain statically >> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run >> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I >> >>>> wonder if it is OK because Intel sells this library. Nevertheless, >> >>>> it seems that in my case precompiled MKL BLAS performs better than >> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed >> to be on par with JNI overheads. >> >>>> >> >>>> Though, it might be interesting to link Netlib-java with Intel MKL, >> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam >> >>>> Halliday >> >>>> (Netlib-java) interested to compare their libraries. >> >>>> >> >>>> Best regards, Alexander >> >>>> >> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >> evan.spa...@gmail.com><mailto: >> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >> >>>> Sent: Friday, February 06, 2015 5:58 PM >> >>>> >> >>>> To: Ulanov, Alexander >> >>>> Cc: Joseph Bradley; >> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >> >>>> apache.org<mailto:dev@spark.apache.org>> >> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >>>> >> >>>> I would build OpenBLAS yourself, since good BLAS performance comes >> >>>> from getting cache sizes, etc. set up correctly for your particular >> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS), >> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds >> >>>> quickly and yields performance competitive with MKL. >> >>>> >> >>>> To make sure the right library is getting used, you have to make >> >>>> sure it's first on the search path - export >> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. >> >>>> >> >>>> For some examples of getting netlib-java setup on an ec2 node and >> >>>> some example benchmarking code we ran a while back, see: >> >>>> https://github.com/shivaram/matrix-bench >> >>>> >> >>>> In particular - build-openblas-ec2.sh shows you how to build the >> >>>> library and set up symlinks correctly, and scala/run-netlib.sh >> >>>> shows you how to get the path setup and get that library picked up >> by netlib-java. >> >>>> >> >>>> In this way - you could probably get cuBLAS set up to be used by >> >>>> netlib-java as well. >> >>>> >> >>>> - Evan >> >>>> >> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to >> >>>> force loading the right blas? For netlib, I there are few JVM >> >>>> flags, such as >> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, >> >>>> so I can force it to use Java implementation. Not sure I understand >> how to force use a specific blas (not specific wrapper for blas). >> >>>> >> >>>> Btw. I have installed openblas (yum install openblas), so I suppose >> >>>> that netlib is using it. >> >>>> >> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >> evan.spa...@gmail.com><mailto: >> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >> >>>> Sent: Friday, February 06, 2015 5:19 PM >> >>>> To: Ulanov, Alexander >> >>>> Cc: Joseph Bradley; >> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >> >>>> apache.org<mailto:dev@spark.apache.org>> >> >>>> >> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >>>> >> >>>> Getting breeze to pick up the right blas library is critical for >> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have >> it). >> >>>> It might make sense to force BIDMat to use the same underlying BLAS >> >>>> library as well. >> >>>> >> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >> >>>> Hi Evan, Joseph >> >>>> >> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x >> >>>> faster than netlib-java+breeze (sorry for weird table formatting): >> >>>> >> >>>> |A*B size | BIDMat MKL | Breeze+Netlib-java >> >>>> |native_system_linux_x86-64| >> >>>> Breeze+Netlib-java f2jblas | >> >>>> >> +-----------------------------------------------------------------------+ >> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | >> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | >> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 >> >>>> || >> >>>> >> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora >> >>>> 19 Linux, Scala 2.11. >> >>>> >> >>>> Later I will make tests with Cuda. I need to install new Cuda >> >>>> version for this purpose. >> >>>> >> >>>> Do you have any ideas why breeze-netlib with native blas is so much >> >>>> slower than BIDMat MKL? >> >>>> >> >>>> Best regards, Alexander >> >>>> >> >>>> From: Joseph Bradley [mailto:jos...@databricks.com<mailto: >> jos...@databricks.com><mailto: >> >>>> jos...@databricks.com<mailto:jos...@databricks.com>>] >> >>>> Sent: Thursday, February 05, 2015 5:29 PM >> >>>> To: Ulanov, Alexander >> >>>> Cc: Evan R. Sparks; >> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >> >>>> apache.org<mailto:dev@spark.apache.org>> >> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >>>> >> >>>> Hi Alexander, >> >>>> >> >>>> Using GPUs with Spark would be very exciting. Small comment: >> >>>> Concerning your question earlier about keeping data stored on the >> >>>> GPU rather than having to move it between main memory and GPU >> >>>> memory on each iteration, I would guess this would be critical to >> >>>> getting good performance. If you could do multiple local >> >>>> iterations before aggregating results, then the cost of data >> >>>> movement to the GPU could be amortized (and I believe that is done >> >>>> in practice). Having Spark be aware of the GPU and using it as >> another part of memory sounds like a much bigger undertaking. >> >>>> >> >>>> Joseph >> >>>> >> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by >> >>>> John Canny and I am really inspired by his talk and comparisons with >> Spark MLlib. >> >>>> >> >>>> I am very interested to find out what will be better within Spark: >> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a >> >>>> fair way to benchmark them? Currently I do benchmarks on artificial >> >>>> neural networks in batch mode. While it is not a “pure” test of >> >>>> linear algebra, it involves some other things that are essential to >> machine learning. >> >>>> >> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >> evan.spa...@gmail.com><mailto: >> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >> >>>> Sent: Thursday, February 05, 2015 1:29 PM >> >>>> To: Ulanov, Alexander >> >>>> Cc: >> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >> >>>> apache.org<mailto:dev@spark.apache.org>> >> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >>>> >> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than >> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to >> >>>> netlib-java+data >> >>>> layout and fewer levels of indirection - it's definitely a >> >>>> worthwhile experiment to run. The main speedups I've seen from >> >>>> using it come from highly optimized GPU code for linear algebra. I >> >>>> know that in the past Canny has gone as far as to write custom GPU >> >>>> kernels for performance-critical regions of code.[1] >> >>>> >> >>>> BIDMach is highly optimized for single node performance or >> >>>> performance on small clusters.[2] Once data doesn't fit easily in >> >>>> GPU memory (or can be batched in that way) the performance tends to >> >>>> fall off. Canny argues for hardware/software codesign and as such >> >>>> prefers machine configurations that are quite different than what >> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and >> 4 GPUs. >> >>>> >> >>>> In contrast, MLlib was designed for horizontal scalability on >> >>>> commodity clusters and works best on very big datasets - order of >> terabytes. >> >>>> >> >>>> For the most part, these projects developed concurrently to address >> >>>> slightly different use cases. That said, there may be bits of >> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be >> >>>> careful about maintaining cross-language compatibility for our Java >> >>>> and Python-users, though. >> >>>> >> >>>> - Evan >> >>>> >> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] - >> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf >> >>>> >> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote: >> >>>> Hi Evan, >> >>>> >> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do >> >>>> you know what makes them faster than netlib-java? >> >>>> >> >>>> The same group has BIDMach library that implements machine >> >>>> learning. For some examples they use Caffe convolutional neural >> >>>> network library owned by another group in Berkeley. Could you >> >>>> elaborate on how these all might be connected with Spark Mllib? If >> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for >> optimization and learning? >> >>>> >> >>>> Best regards, Alexander >> >>>> >> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >> evan.spa...@gmail.com><mailto: >> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto: >> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>] >> >>>> Sent: Thursday, February 05, 2015 12:09 PM >> >>>> To: Ulanov, Alexander >> >>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto: >> dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto: >> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >> >>>> apache.org<mailto:dev@spark.apache.org>>> >> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >>>> >> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU >> >>>> blas in many cases. >> >>>> >> >>>> You might consider taking a look at the codepaths that BIDMat ( >> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to >> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work >> >>>> optimizing to make this work really fast from Scala. I've run it on >> >>>> my laptop and compared to MKL and in certain cases it's 10x faster >> at matrix multiply. >> >>>> There are a lot of layers of indirection here and you really want >> >>>> to avoid data copying as much as possible. >> >>>> >> >>>> We could also consider swapping out BIDMat for Breeze, but that >> >>>> would be a big project and if we can figure out how to get >> >>>> breeze+cublas to comparable performance that would be a big win. >> >>>> >> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote: >> >>>> Dear Spark developers, >> >>>> >> >>>> I am exploring how to make linear algebra operations faster within >> Spark. >> >>>> One way of doing this is to use Scala Breeze library that is >> >>>> bundled with Spark. For matrix operations, it employs Netlib-java >> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms) >> >>>> and LAPACK native binaries if they are available on the worker >> >>>> node. It also has its own optimized Java implementation of BLAS. It >> >>>> is worth mentioning, that native binaries provide better performance >> only for BLAS level 3, i.e. >> >>>> matrix-matrix operations or general matrix multiplication (GEMM). >> >>>> This is confirmed by GEMM test on Netlib-java page >> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my >> >>>> experiments with training of artificial neural network >> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952. >> >>>> However, I would like to boost performance more. >> >>>> >> >>>> GPU is supposed to work fast with linear algebra and there is >> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux >> >>>> server with Nvidia GPU and I was able to do the following. I linked >> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put >> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some >> >>>> performance measurements with regards to artificial neural network >> >>>> batch learning in Spark MLlib that involves matrix-matrix >> >>>> multiplications. It turns out that for matrices of size less than >> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes >> >>>> slower for bigger matrices. It worth mentioning that it is was not a >> test for ONLY multiplication since there are other operations involved. >> >>>> One of the reasons for slowdown might be the overhead of copying >> >>>> the matrices from computer memory to graphic card memory and back. >> >>>> >> >>>> So, few questions: >> >>>> 1) Do these results with CUDA make sense? >> >>>> 2) If the problem is with copy overhead, are there any libraries >> >>>> that allow to force intermediate results to stay in graphic card >> >>>> memory thus removing the overhead? >> >>>> 3) Any other options to speed-up linear algebra in Spark? >> >>>> >> >>>> Thank you, Alexander >> >>>> >> >>>> ------------------------------------------------------------------- >> >>>> -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto: >> dev-unsubscr...@spark.apache.org><mailto: >> >>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spark.apach >> >>>> e.org>><mailto:dev-unsubscr...@spark.apac<mailto:dev-unsubscribe@sp >> >>>> ark.apac> he.org<http://he.org> >> >>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spa >> >>>> rk.apache.org>>> For additional commands, e-mail: >> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: >> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>><mailto: >> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: >> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>> >> >> -- >> Best regards, >> Sam >> >> >> >