That would be a difficult task that would only benefit users of netlib-java. MultiBLAS is easily implemented (although a lot of boilerplate) and benefits all BLAS users on the system.
If anyone knows of a funding route for it, I'd love to hear from them, because it's too much work for me to take on at the moment as hobby. On 25 Mar 2015 22:16, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote: > Sam, > > whould it be easier to hack netlib-java to allow multiple (configurable) > library contexts? And so enable 3rd party configurations and optimizers to > make their own choices until then? > > On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday <sam.halli...@gmail.com> > wrote: > >> Yeah, MultiBLAS... it is dynamic. >> >> Except, I haven't written it yet :-P >> On 25 Mar 2015 22:06, "Ulanov, Alexander" <alexander.ula...@hp.com> >> wrote: >> >>> Netlib knows nothing about GPU (or CPU), it just uses cblas symbols >>> from the provided libblas.so.3 library at the runtime. So, you can switch >>> at the runtime by providing another library. Sam, please suggest if there >>> is another way. >>> >>> >>> >>> *From:* Dmitriy Lyubimov [mailto:dlie...@gmail.com] >>> *Sent:* Wednesday, March 25, 2015 2:55 PM >>> *To:* Ulanov, Alexander >>> *Cc:* Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph >>> Bradley; Evan R. Sparks; jfcanny >>> *Subject:* Re: Using CUDA within Spark / boosting linear algebra >>> >>> >>> >>> Alexander, >>> >>> >>> >>> does using netlib imply that one cannot switch between CPU and GPU blas >>> alternatives at will at the same time? the choice is always determined by >>> linking aliternatives to libblas.so, right? >>> >>> >>> >>> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander < >>> alexander.ula...@hp.com> wrote: >>> >>> Hi again, >>> >>> I finally managed to use nvblas within Spark+netlib-java. It has >>> exceptional performance for big matrices with Double, faster than >>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them >>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with >>> original nvblas presentation on GPU conf 2013 (slide 21): >>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf >>> >>> My results: >>> >>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >>> >>> Just in case, these tests are not for generalization of performance of >>> different libraries. I just want to pick a library that does at best dense >>> matrices multiplication for my task. >>> >>> P.S. My previous issue with nvblas was the following: it has Fortran >>> blas functions, at the same time netlib-java uses C cblas functions. So, >>> one needs cblas shared library to use nvblas through netlib-java. Fedora >>> does not have cblas (but Debian and Ubuntu have), so I needed to compile >>> it. I could not use cblas from Atlas or Openblas because they link to their >>> implementation and not to Fortran blas. >>> >>> Best regards, Alexander >>> >>> -----Original Message----- >>> From: Ulanov, Alexander >>> >>> Sent: Tuesday, March 24, 2015 6:57 PM >>> To: Sam Halliday >>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks >>> Subject: RE: Using CUDA within Spark / boosting linear algebra >>> >>> Hi, >>> >>> I am trying to use nvblas with netlib-java from Spark. nvblas functions >>> should replace current blas functions calls after executing LD_PRELOAD as >>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any >>> changes to netlib-java. It seems to work for simple Java example, but I >>> cannot make it work with Spark. I run the following: >>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 >>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell >>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: >>> >>> +-----------------------------------------------------------------------------+ >>> | Processes: GPU >>> Memory | >>> | GPU PID Type Process name >>> Usage | >>> >>> |=============================================================================| >>> | 0 8873 C bash >>> 39MiB | >>> | 0 8910 C /usr/lib/jvm/java-1.7.0/bin/java >>> 39MiB | >>> >>> +-----------------------------------------------------------------------------+ >>> >>> In Spark shell I do matrix multiplication and see the following: >>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded >>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so >>> So I am sure that netlib-native is loaded and cblas supposedly used. >>> However, matrix multiplication does executes on CPU since I see 16% of CPU >>> used and 0% of GPU used. I also checked different matrix sizes, from >>> 100x100 to 12000x12000 >>> >>> Could you suggest might the LD_PRELOAD not affect Spark shell? >>> >>> Best regards, Alexander >>> >>> >>> >>> From: Sam Halliday [mailto:sam.halli...@gmail.com] >>> Sent: Monday, March 09, 2015 6:01 PM >>> To: Ulanov, Alexander >>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks >>> Subject: RE: Using CUDA within Spark / boosting linear algebra >>> >>> >>> Thanks so much for following up on this! >>> >>> Hmm, I wonder if we should have a concerted effort to chart performance >>> on various pieces of hardware... >>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ula...@hp.com >>> <mailto:alexander.ula...@hp.com>> wrote: >>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the >>> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the >>> support of Double in the current source code), did the test with BIDMat and >>> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. >>> >>> >>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >>> >>> Best regards, Alexander >>> >>> -----Original Message----- >>> From: Sam Halliday [mailto:sam.halli...@gmail.com<mailto: >>> sam.halli...@gmail.com>] >>> Sent: Tuesday, March 03, 2015 1:54 PM >>> To: Xiangrui Meng; Joseph Bradley >>> Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org<mailto: >>> dev@spark.apache.org> >>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>> >>> BTW, is anybody on this list going to the London Meetup in a few weeks? >>> >>> >>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community >>> >>> Would be nice to meet other people working on the guts of Spark! :-) >>> >>> >>> Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>> writes: >>> >>> > Hey Alexander, >>> > >>> > I don't quite understand the part where netlib-cublas is about 20x >>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS >>> > with netlib-java? >>> > >>> > CC'ed Sam, the author of netlib-java. >>> > >>> > Best, >>> > Xiangrui >>> > >>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <jos...@databricks.com >>> <mailto:jos...@databricks.com>> wrote: >>> >> Better documentation for linking would be very helpful! Here's a >>> JIRA: >>> >> https://issues.apache.org/jira/browse/SPARK-6019 >>> >> >>> >> >>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks >>> >> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>> >>> >> wrote: >>> >> >>> >>> Thanks for compiling all the data and running these benchmarks, >>> >>> Alex. The big takeaways here can be seen with this chart: >>> >>> >>> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ >>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive >>> >>> >>> >>> 1) A properly configured GPU matrix multiply implementation (e.g. >>> >>> BIDMat+GPU) can provide substantial (but less than an order of >>> >>> BIDMat+magnitude) >>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or >>> >>> netlib-java+openblas-compiled). >>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude >>> >>> worse than a well-tuned CPU implementation, particularly for larger >>> matrices. >>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this >>> >>> basically agrees with the authors own benchmarks ( >>> >>> https://github.com/fommil/netlib-java) >>> >>> >>> >>> I think that most of our users are in a situation where using GPUs >>> >>> may not be practical - although we could consider having a good GPU >>> >>> backend available as an option. However, *ALL* users of MLlib could >>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based >>> >>> BLAS implementation. Perhaps we should consider updating the mllib >>> >>> guide with a more complete section for enabling high performance >>> >>> binaries on OSX and Linux? Or better, figure out a way for the >>> >>> system to fetch these automatically. >>> >>> >>> >>> - Evan >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < >>> >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >>> >>> >>> >>>> Just to summarize this thread, I was finally able to make all >>> >>>> performance comparisons that we discussed. It turns out that: >>> >>>> BIDMat-cublas>>BIDMat >>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo= >>> >>>> =netlib-cublas>netlib-blas>f2jblas >>> >>>> >>> >>>> Below is the link to the spreadsheet with full results. >>> >>>> >>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx >>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing >>> >>>> >>> >>>> One thing still needs exploration: does BIDMat-cublas perform >>> >>>> copying to/from machine’s RAM? >>> >>>> >>> >>>> -----Original Message----- >>> >>>> From: Ulanov, Alexander >>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM >>> >>>> To: Evan R. Sparks >>> >>>> Cc: Joseph Bradley; >>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra >>> >>>> >>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though >>> >>>> the original one discusses slightly different topic. I was able to >>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is >>> >>>> statically linked inside a 60MB library. >>> >>>> >>> >>>> |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| >>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | >>> >>>> >>> +-----------------------------------------------------------------------+ >>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | >>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 >>> >>>> |1,638475459 | >>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 | >>> >>>> 1569,233228 | >>> >>>> >>> >>>> It turn out that pre-compiled MKL is faster than precompiled >>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with >>> >>>> locally compiled openblas and cuda. >>> >>>> >>> >>>> Alexander >>> >>>> >>> >>>> From: Evan R. Sparks >>> >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>] >>> >>>> Sent: Monday, February 09, 2015 6:06 PM >>> >>>> To: Ulanov, Alexander >>> >>>> Cc: Joseph Bradley; >>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>> >>>> >>> >>>> Great - perhaps we can move this discussion off-list and onto a >>> >>>> JIRA ticket? (Here's one: >>> >>>> https://issues.apache.org/jira/browse/SPARK-5705) >>> >>>> >>> >>>> It seems like this is going to be somewhat exploratory for a while >>> >>>> (and there's probably only a handful of us who really care about >>> >>>> fast linear >>> >>>> algebra!) >>> >>>> >>> >>>> - Evan >>> >>>> >>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < >>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >>> >>>> Hi Evan, >>> >>>> >>> >>>> Thank you for explanation and useful link. I am going to build >>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again. >>> >>>> >>> >>>> Do I understand correctly that BIDMat binaries contain statically >>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run >>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I >>> >>>> wonder if it is OK because Intel sells this library. Nevertheless, >>> >>>> it seems that in my case precompiled MKL BLAS performs better than >>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are supposed >>> to be on par with JNI overheads. >>> >>>> >>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL, >>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam >>> >>>> Halliday >>> >>>> (Netlib-java) interested to compare their libraries. >>> >>>> >>> >>>> Best regards, Alexander >>> >>>> >>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >>> evan.spa...@gmail.com><mailto: >>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>> >>>> Sent: Friday, February 06, 2015 5:58 PM >>> >>>> >>> >>>> To: Ulanov, Alexander >>> >>>> Cc: Joseph Bradley; >>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>> >>>> apache.org<mailto:dev@spark.apache.org>> >>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>> >>>> >>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes >>> >>>> from getting cache sizes, etc. set up correctly for your particular >>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS), >>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds >>> >>>> quickly and yields performance competitive with MKL. >>> >>>> >>> >>>> To make sure the right library is getting used, you have to make >>> >>>> sure it's first on the search path - export >>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. >>> >>>> >>> >>>> For some examples of getting netlib-java setup on an ec2 node and >>> >>>> some example benchmarking code we ran a while back, see: >>> >>>> https://github.com/shivaram/matrix-bench >>> >>>> >>> >>>> In particular - build-openblas-ec2.sh shows you how to build the >>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh >>> >>>> shows you how to get the path setup and get that library picked up >>> by netlib-java. >>> >>>> >>> >>>> In this way - you could probably get cuBLAS set up to be used by >>> >>>> netlib-java as well. >>> >>>> >>> >>>> - Evan >>> >>>> >>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < >>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to >>> >>>> force loading the right blas? For netlib, I there are few JVM >>> >>>> flags, such as >>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, >>> >>>> so I can force it to use Java implementation. Not sure I understand >>> how to force use a specific blas (not specific wrapper for blas). >>> >>>> >>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose >>> >>>> that netlib is using it. >>> >>>> >>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >>> evan.spa...@gmail.com><mailto: >>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>> >>>> Sent: Friday, February 06, 2015 5:19 PM >>> >>>> To: Ulanov, Alexander >>> >>>> Cc: Joseph Bradley; >>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>> >>>> apache.org<mailto:dev@spark.apache.org>> >>> >>>> >>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>> >>>> >>> >>>> Getting breeze to pick up the right blas library is critical for >>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already >>> have it). >>> >>>> It might make sense to force BIDMat to use the same underlying BLAS >>> >>>> library as well. >>> >>>> >>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < >>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >>> >>>> Hi Evan, Joseph >>> >>>> >>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x >>> >>>> faster than netlib-java+breeze (sorry for weird table formatting): >>> >>>> >>> >>>> |A*B size | BIDMat MKL | Breeze+Netlib-java >>> >>>> |native_system_linux_x86-64| >>> >>>> Breeze+Netlib-java f2jblas | >>> >>>> >>> +-----------------------------------------------------------------------+ >>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | >>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | >>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 >>> >>>> || >>> >>>> >>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora >>> >>>> 19 Linux, Scala 2.11. >>> >>>> >>> >>>> Later I will make tests with Cuda. I need to install new Cuda >>> >>>> version for this purpose. >>> >>>> >>> >>>> Do you have any ideas why breeze-netlib with native blas is so much >>> >>>> slower than BIDMat MKL? >>> >>>> >>> >>>> Best regards, Alexander >>> >>>> >>> >>>> From: Joseph Bradley [mailto:jos...@databricks.com<mailto: >>> jos...@databricks.com><mailto: >>> >>>> jos...@databricks.com<mailto:jos...@databricks.com>>] >>> >>>> Sent: Thursday, February 05, 2015 5:29 PM >>> >>>> To: Ulanov, Alexander >>> >>>> Cc: Evan R. Sparks; >>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>> >>>> apache.org<mailto:dev@spark.apache.org>> >>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>> >>>> >>> >>>> Hi Alexander, >>> >>>> >>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>> >>>> Concerning your question earlier about keeping data stored on the >>> >>>> GPU rather than having to move it between main memory and GPU >>> >>>> memory on each iteration, I would guess this would be critical to >>> >>>> getting good performance. If you could do multiple local >>> >>>> iterations before aggregating results, then the cost of data >>> >>>> movement to the GPU could be amortized (and I believe that is done >>> >>>> in practice). Having Spark be aware of the GPU and using it as >>> another part of memory sounds like a much bigger undertaking. >>> >>>> >>> >>>> Joseph >>> >>>> >>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < >>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by >>> >>>> John Canny and I am really inspired by his talk and comparisons >>> with Spark MLlib. >>> >>>> >>> >>>> I am very interested to find out what will be better within Spark: >>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a >>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial >>> >>>> neural networks in batch mode. While it is not a “pure” test of >>> >>>> linear algebra, it involves some other things that are essential to >>> machine learning. >>> >>>> >>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >>> evan.spa...@gmail.com><mailto: >>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>> >>>> Sent: Thursday, February 05, 2015 1:29 PM >>> >>>> To: Ulanov, Alexander >>> >>>> Cc: >>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>> >>>> apache.org<mailto:dev@spark.apache.org>> >>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>> >>>> >>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than >>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to >>> >>>> netlib-java+data >>> >>>> layout and fewer levels of indirection - it's definitely a >>> >>>> worthwhile experiment to run. The main speedups I've seen from >>> >>>> using it come from highly optimized GPU code for linear algebra. I >>> >>>> know that in the past Canny has gone as far as to write custom GPU >>> >>>> kernels for performance-critical regions of code.[1] >>> >>>> >>> >>>> BIDMach is highly optimized for single node performance or >>> >>>> performance on small clusters.[2] Once data doesn't fit easily in >>> >>>> GPU memory (or can be batched in that way) the performance tends to >>> >>>> fall off. Canny argues for hardware/software codesign and as such >>> >>>> prefers machine configurations that are quite different than what >>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels and >>> 4 GPUs. >>> >>>> >>> >>>> In contrast, MLlib was designed for horizontal scalability on >>> >>>> commodity clusters and works best on very big datasets - order of >>> terabytes. >>> >>>> >>> >>>> For the most part, these projects developed concurrently to address >>> >>>> slightly different use cases. That said, there may be bits of >>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be >>> >>>> careful about maintaining cross-language compatibility for our Java >>> >>>> and Python-users, though. >>> >>>> >>> >>>> - Evan >>> >>>> >>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] - >>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf >>> >>>> >>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < >>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote: >>> >>>> Hi Evan, >>> >>>> >>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do >>> >>>> you know what makes them faster than netlib-java? >>> >>>> >>> >>>> The same group has BIDMach library that implements machine >>> >>>> learning. For some examples they use Caffe convolutional neural >>> >>>> network library owned by another group in Berkeley. Could you >>> >>>> elaborate on how these all might be connected with Spark Mllib? If >>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for >>> optimization and learning? >>> >>>> >>> >>>> Best regards, Alexander >>> >>>> >>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >>> evan.spa...@gmail.com><mailto: >>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto: >>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>] >>> >>>> Sent: Thursday, February 05, 2015 12:09 PM >>> >>>> To: Ulanov, Alexander >>> >>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto: >>> dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto: >>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark. >>> >>>> apache.org<mailto:dev@spark.apache.org>>> >>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>> >>>> >>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU >>> >>>> blas in many cases. >>> >>>> >>> >>>> You might consider taking a look at the codepaths that BIDMat ( >>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to >>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work >>> >>>> optimizing to make this work really fast from Scala. I've run it on >>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster >>> at matrix multiply. >>> >>>> There are a lot of layers of indirection here and you really want >>> >>>> to avoid data copying as much as possible. >>> >>>> >>> >>>> We could also consider swapping out BIDMat for Breeze, but that >>> >>>> would be a big project and if we can figure out how to get >>> >>>> breeze+cublas to comparable performance that would be a big win. >>> >>>> >>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < >>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote: >>> >>>> Dear Spark developers, >>> >>>> >>> >>>> I am exploring how to make linear algebra operations faster within >>> Spark. >>> >>>> One way of doing this is to use Scala Breeze library that is >>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java >>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms) >>> >>>> and LAPACK native binaries if they are available on the worker >>> >>>> node. It also has its own optimized Java implementation of BLAS. It >>> >>>> is worth mentioning, that native binaries provide better >>> performance only for BLAS level 3, i.e. >>> >>>> matrix-matrix operations or general matrix multiplication (GEMM). >>> >>>> This is confirmed by GEMM test on Netlib-java page >>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my >>> >>>> experiments with training of artificial neural network >>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952. >>> >>>> However, I would like to boost performance more. >>> >>>> >>> >>>> GPU is supposed to work fast with linear algebra and there is >>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux >>> >>>> server with Nvidia GPU and I was able to do the following. I linked >>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put >>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some >>> >>>> performance measurements with regards to artificial neural network >>> >>>> batch learning in Spark MLlib that involves matrix-matrix >>> >>>> multiplications. It turns out that for matrices of size less than >>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes >>> >>>> slower for bigger matrices. It worth mentioning that it is was not >>> a test for ONLY multiplication since there are other operations involved. >>> >>>> One of the reasons for slowdown might be the overhead of copying >>> >>>> the matrices from computer memory to graphic card memory and back. >>> >>>> >>> >>>> So, few questions: >>> >>>> 1) Do these results with CUDA make sense? >>> >>>> 2) If the problem is with copy overhead, are there any libraries >>> >>>> that allow to force intermediate results to stay in graphic card >>> >>>> memory thus removing the overhead? >>> >>>> 3) Any other options to speed-up linear algebra in Spark? >>> >>>> >>> >>>> Thank you, Alexander >>> >>>> >>> >>>> ------------------------------------------------------------------- >>> >>>> -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto: >>> dev-unsubscr...@spark.apache.org><mailto: >>> >>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spark.apach >>> >>>> e.org>><mailto:dev-unsubscr...@spark.apac<mailto:dev-unsubscribe@sp >>> >>>> ark.apac> he.org<http://he.org> >>> >>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscribe@spa >>> >>>> rk.apache.org>>> For additional commands, e-mail: >>> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: >>> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org >>> >><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org >>> ><mailto: >>> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>> >>> >>> -- >>> Best regards, >>> Sam >>> >>> >>> >> >