Alright Sam - you are the expert here. If the GPL issues are unavoidable, that's fine - what is the exact bit of code that is GPL?
The suggestion to use OpenBLAS is not to say it's the best option, but that it's a *free, reasonable default* for many users - keep in mind the most common deployment for Spark/MLlib is on 64-bit linux on EC2[1]. Additionally, for many of the problems we're targeting, this reasonable default can provide a 1-2 orders of magnitude improvement in performance over the f2jblas implementation that netlib-java falls back on. The JVM issues are trickier, I agree - so it sounds like a good user guide explaining the tradeoffs and configurations procedures as they relate to spark is a reasonable way forward. [1] - https://gigaom.com/2015/01/27/a-few-interesting-numbers-about-apache-spark/ On Thu, Mar 26, 2015 at 12:54 AM, Sam Halliday <sam.halli...@gmail.com> wrote: > Btw, OpenBLAS requires GPL runtime binaries which are typically considered > "system libraries" (and these fall under something similar to the Java > classpath exception rule)... so it's basically impossible to distribute > OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in > Spark right now to clear up something of this nature. > > On a more technical level, I'd recommend watching my talk at ScalaX which > explains in detail why high performance only comes from machine optimised > binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway > on the CPU, not OpenBLAS). > > On an even deeper level, using natives has consequences to JIT and GC > which isn't suitable for everybody and we'd really like people to go into > that with their eyes wide open. > On 26 Mar 2015 07:43, "Sam Halliday" <sam.halli...@gmail.com> wrote: > >> I'm not at all surprised ;-) I fully expect the GPU performance to get >> better automatically as the hardware improves. >> >> Netlib natives still need to be shipped separately. I'd also oppose any >> move to make Open BLAS the default - is not always better and I think >> natives really need DevOps buy-in. It's not the right solution for >> everybody. >> On 26 Mar 2015 01:23, "Evan R. Sparks" <evan.spa...@gmail.com> wrote: >> >>> Yeah, much more reasonable - nice to know that we can get full GPU >>> performance from breeze/netlib-java - meaning there's no compelling >>> performance reason to switch out our current linear algebra library (at >>> least as far as this benchmark is concerned). >>> >>> Instead, it looks like a user guide for configuring Spark/MLlib to use >>> the right BLAS library will get us most of the way there. Or, would it make >>> sense to finally ship openblas compiled for some common platforms (64-bit >>> linux, windows, mac) directly with Spark - hopefully eliminating the jblas >>> warnings once and for all for most users? (Licensing is BSD) Or am I >>> missing something? >>> >>> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander < >>> alexander.ula...@hp.com> wrote: >>> >>>> As everyone suggested, the results were too good to be true, so I >>>> double-checked them. It turns that nvblas did not do multiplication due to >>>> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My >>>> previously posted results with nvblas are matrices copying only. The >>>> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I >>>> handpicked other values that worked. As a result, netlib+nvblas is on par >>>> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas >>>> configuration. >>>> >>>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Ulanov, Alexander >>>> Sent: Wednesday, March 25, 2015 2:31 PM >>>> To: Sam Halliday >>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. >>>> Sparks; jfcanny >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra >>>> >>>> Hi again, >>>> >>>> I finally managed to use nvblas within Spark+netlib-java. It has >>>> exceptional performance for big matrices with Double, faster than >>>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them >>>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with >>>> original nvblas presentation on GPU conf 2013 (slide 21): >>>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf >>>> >>>> My results: >>>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >>>> >>>> Just in case, these tests are not for generalization of performance of >>>> different libraries. I just want to pick a library that does at best dense >>>> matrices multiplication for my task. >>>> >>>> P.S. My previous issue with nvblas was the following: it has Fortran >>>> blas functions, at the same time netlib-java uses C cblas functions. So, >>>> one needs cblas shared library to use nvblas through netlib-java. Fedora >>>> does not have cblas (but Debian and Ubuntu have), so I needed to compile >>>> it. I could not use cblas from Atlas or Openblas because they link to their >>>> implementation and not to Fortran blas. >>>> >>>> Best regards, Alexander >>>> >>>> -----Original Message----- >>>> From: Ulanov, Alexander >>>> Sent: Tuesday, March 24, 2015 6:57 PM >>>> To: Sam Halliday >>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra >>>> >>>> Hi, >>>> >>>> I am trying to use nvblas with netlib-java from Spark. nvblas functions >>>> should replace current blas functions calls after executing LD_PRELOAD as >>>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any >>>> changes to netlib-java. It seems to work for simple Java example, but I >>>> cannot make it work with Spark. I run the following: >>>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64 >>>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell >>>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU: >>>> >>>> +-----------------------------------------------------------------------------+ >>>> | Processes: GPU >>>> Memory | >>>> | GPU PID Type Process name >>>> Usage | >>>> >>>> |=============================================================================| >>>> | 0 8873 C bash >>>> 39MiB | >>>> | 0 8910 C /usr/lib/jvm/java-1.7.0/bin/java >>>> 39MiB | >>>> >>>> +-----------------------------------------------------------------------------+ >>>> >>>> In Spark shell I do matrix multiplication and see the following: >>>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded >>>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so >>>> So I am sure that netlib-native is loaded and cblas supposedly used. >>>> However, matrix multiplication does executes on CPU since I see 16% of CPU >>>> used and 0% of GPU used. I also checked different matrix sizes, from >>>> 100x100 to 12000x12000 >>>> >>>> Could you suggest might the LD_PRELOAD not affect Spark shell? >>>> >>>> Best regards, Alexander >>>> >>>> >>>> >>>> From: Sam Halliday [mailto:sam.halli...@gmail.com] >>>> Sent: Monday, March 09, 2015 6:01 PM >>>> To: Ulanov, Alexander >>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra >>>> >>>> >>>> Thanks so much for following up on this! >>>> >>>> Hmm, I wonder if we should have a concerted effort to chart performance >>>> on various pieces of hardware... >>>> On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ula...@hp.com >>>> <mailto:alexander.ula...@hp.com>> wrote: >>>> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added >>>> the comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see >>>> the support of Double in the current source code), did the test with BIDMat >>>> and CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL. >>>> >>>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >>>> >>>> Best regards, Alexander >>>> >>>> -----Original Message----- >>>> From: Sam Halliday [mailto:sam.halli...@gmail.com<mailto: >>>> sam.halli...@gmail.com>] >>>> Sent: Tuesday, March 03, 2015 1:54 PM >>>> To: Xiangrui Meng; Joseph Bradley >>>> Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org<mailto: >>>> dev@spark.apache.org> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> BTW, is anybody on this list going to the London Meetup in a few weeks? >>>> >>>> >>>> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community >>>> >>>> Would be nice to meet other people working on the guts of Spark! :-) >>>> >>>> >>>> Xiangrui Meng <men...@gmail.com<mailto:men...@gmail.com>> writes: >>>> >>>> > Hey Alexander, >>>> > >>>> > I don't quite understand the part where netlib-cublas is about 20x >>>> > slower than netlib-openblas. What is the overhead of using a GPU BLAS >>>> > with netlib-java? >>>> > >>>> > CC'ed Sam, the author of netlib-java. >>>> > >>>> > Best, >>>> > Xiangrui >>>> > >>>> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley < >>>> jos...@databricks.com<mailto:jos...@databricks.com>> wrote: >>>> >> Better documentation for linking would be very helpful! Here's a >>>> JIRA: >>>> >> https://issues.apache.org/jira/browse/SPARK-6019 >>>> >> >>>> >> >>>> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks >>>> >> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>> >>>> >> wrote: >>>> >> >>>> >>> Thanks for compiling all the data and running these benchmarks, >>>> >>> Alex. The big takeaways here can be seen with this chart: >>>> >>> >>>> >>> >>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ >>>> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive >>>> >>> >>>> >>> 1) A properly configured GPU matrix multiply implementation (e.g. >>>> >>> BIDMat+GPU) can provide substantial (but less than an order of >>>> >>> BIDMat+magnitude) >>>> >>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or >>>> >>> netlib-java+openblas-compiled). >>>> >>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude >>>> >>> worse than a well-tuned CPU implementation, particularly for larger >>>> matrices. >>>> >>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this >>>> >>> basically agrees with the authors own benchmarks ( >>>> >>> https://github.com/fommil/netlib-java) >>>> >>> >>>> >>> I think that most of our users are in a situation where using GPUs >>>> >>> may not be practical - although we could consider having a good GPU >>>> >>> backend available as an option. However, *ALL* users of MLlib could >>>> >>> benefit (potentially tremendously) from using a well-tuned CPU-based >>>> >>> BLAS implementation. Perhaps we should consider updating the mllib >>>> >>> guide with a more complete section for enabling high performance >>>> >>> binaries on OSX and Linux? Or better, figure out a way for the >>>> >>> system to fetch these automatically. >>>> >>> >>>> >>> - Evan >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < >>>> >>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >>>> >>> >>>> >>>> Just to summarize this thread, I was finally able to make all >>>> >>>> performance comparisons that we discussed. It turns out that: >>>> >>>> BIDMat-cublas>>BIDMat >>>> >>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo= >>>> >>>> =netlib-cublas>netlib-blas>f2jblas >>>> >>>> >>>> >>>> Below is the link to the spreadsheet with full results. >>>> >>>> >>>> >>>> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx >>>> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing >>>> >>>> >>>> >>>> One thing still needs exploration: does BIDMat-cublas perform >>>> >>>> copying to/from machine’s RAM? >>>> >>>> >>>> >>>> -----Original Message----- >>>> >>>> From: Ulanov, Alexander >>>> >>>> Sent: Tuesday, February 10, 2015 2:12 PM >>>> >>>> To: Evan R. Sparks >>>> >>>> Cc: Joseph Bradley; >>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >>>> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra >>>> >>>> >>>> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though >>>> >>>> the original one discusses slightly different topic. I was able to >>>> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is >>>> >>>> statically linked inside a 60MB library. >>>> >>>> >>>> >>>> |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| >>>> >>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | >>>> >>>> >>>> +-----------------------------------------------------------------------+ >>>> >>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | >>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 >>>> >>>> |1,638475459 | >>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 | >>>> >>>> 1569,233228 | >>>> >>>> >>>> >>>> It turn out that pre-compiled MKL is faster than precompiled >>>> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with >>>> >>>> locally compiled openblas and cuda. >>>> >>>> >>>> >>>> Alexander >>>> >>>> >>>> >>>> From: Evan R. Sparks >>>> >>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>] >>>> >>>> Sent: Monday, February 09, 2015 6:06 PM >>>> >>>> To: Ulanov, Alexander >>>> >>>> Cc: Joseph Bradley; >>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> >>>> >>>> Great - perhaps we can move this discussion off-list and onto a >>>> >>>> JIRA ticket? (Here's one: >>>> >>>> https://issues.apache.org/jira/browse/SPARK-5705) >>>> >>>> >>>> >>>> It seems like this is going to be somewhat exploratory for a while >>>> >>>> (and there's probably only a handful of us who really care about >>>> >>>> fast linear >>>> >>>> algebra!) >>>> >>>> >>>> >>>> - Evan >>>> >>>> >>>> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < >>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >>>> >>>> Hi Evan, >>>> >>>> >>>> >>>> Thank you for explanation and useful link. I am going to build >>>> >>>> OpenBLAS, link it with Netlib-java and perform benchmark again. >>>> >>>> >>>> >>>> Do I understand correctly that BIDMat binaries contain statically >>>> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run >>>> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I >>>> >>>> wonder if it is OK because Intel sells this library. Nevertheless, >>>> >>>> it seems that in my case precompiled MKL BLAS performs better than >>>> >>>> precompiled OpenBLAS given that BIDMat and Netlib-java are >>>> supposed to be on par with JNI overheads. >>>> >>>> >>>> >>>> Though, it might be interesting to link Netlib-java with Intel MKL, >>>> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam >>>> >>>> Halliday >>>> >>>> (Netlib-java) interested to compare their libraries. >>>> >>>> >>>> >>>> Best regards, Alexander >>>> >>>> >>>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >>>> evan.spa...@gmail.com><mailto: >>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>>> >>>> Sent: Friday, February 06, 2015 5:58 PM >>>> >>>> >>>> >>>> To: Ulanov, Alexander >>>> >>>> Cc: Joseph Bradley; >>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark >>>> . >>>> >>>> apache.org<mailto:dev@spark.apache.org>> >>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> >>>> >>>> I would build OpenBLAS yourself, since good BLAS performance comes >>>> >>>> from getting cache sizes, etc. set up correctly for your particular >>>> >>>> hardware - this is often a very tricky process (see, e.g. ATLAS), >>>> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds >>>> >>>> quickly and yields performance competitive with MKL. >>>> >>>> >>>> >>>> To make sure the right library is getting used, you have to make >>>> >>>> sure it's first on the search path - export >>>> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. >>>> >>>> >>>> >>>> For some examples of getting netlib-java setup on an ec2 node and >>>> >>>> some example benchmarking code we ran a while back, see: >>>> >>>> https://github.com/shivaram/matrix-bench >>>> >>>> >>>> >>>> In particular - build-openblas-ec2.sh shows you how to build the >>>> >>>> library and set up symlinks correctly, and scala/run-netlib.sh >>>> >>>> shows you how to get the path setup and get that library picked up >>>> by netlib-java. >>>> >>>> >>>> >>>> In this way - you could probably get cuBLAS set up to be used by >>>> >>>> netlib-java as well. >>>> >>>> >>>> >>>> - Evan >>>> >>>> >>>> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < >>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >>>> >>>> Evan, could you elaborate on how to force BIDMat and netlib-java to >>>> >>>> force loading the right blas? For netlib, I there are few JVM >>>> >>>> flags, such as >>>> >>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, >>>> >>>> so I can force it to use Java implementation. Not sure I >>>> understand how to force use a specific blas (not specific wrapper for >>>> blas). >>>> >>>> >>>> >>>> Btw. I have installed openblas (yum install openblas), so I suppose >>>> >>>> that netlib is using it. >>>> >>>> >>>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >>>> evan.spa...@gmail.com><mailto: >>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>>> >>>> Sent: Friday, February 06, 2015 5:19 PM >>>> >>>> To: Ulanov, Alexander >>>> >>>> Cc: Joseph Bradley; >>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark >>>> . >>>> >>>> apache.org<mailto:dev@spark.apache.org>> >>>> >>>> >>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> >>>> >>>> Getting breeze to pick up the right blas library is critical for >>>> >>>> performance. I recommend using OpenBLAS (or MKL, if you already >>>> have it). >>>> >>>> It might make sense to force BIDMat to use the same underlying BLAS >>>> >>>> library as well. >>>> >>>> >>>> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < >>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >>>> >>>> Hi Evan, Joseph >>>> >>>> >>>> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x >>>> >>>> faster than netlib-java+breeze (sorry for weird table formatting): >>>> >>>> >>>> >>>> |A*B size | BIDMat MKL | Breeze+Netlib-java >>>> >>>> |native_system_linux_x86-64| >>>> >>>> Breeze+Netlib-java f2jblas | >>>> >>>> >>>> +-----------------------------------------------------------------------+ >>>> >>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | >>>> >>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | >>>> >>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 >>>> >>>> || >>>> >>>> >>>> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora >>>> >>>> 19 Linux, Scala 2.11. >>>> >>>> >>>> >>>> Later I will make tests with Cuda. I need to install new Cuda >>>> >>>> version for this purpose. >>>> >>>> >>>> >>>> Do you have any ideas why breeze-netlib with native blas is so much >>>> >>>> slower than BIDMat MKL? >>>> >>>> >>>> >>>> Best regards, Alexander >>>> >>>> >>>> >>>> From: Joseph Bradley [mailto:jos...@databricks.com<mailto: >>>> jos...@databricks.com><mailto: >>>> >>>> jos...@databricks.com<mailto:jos...@databricks.com>>] >>>> >>>> Sent: Thursday, February 05, 2015 5:29 PM >>>> >>>> To: Ulanov, Alexander >>>> >>>> Cc: Evan R. Sparks; >>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark >>>> . >>>> >>>> apache.org<mailto:dev@spark.apache.org>> >>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> >>>> >>>> Hi Alexander, >>>> >>>> >>>> >>>> Using GPUs with Spark would be very exciting. Small comment: >>>> >>>> Concerning your question earlier about keeping data stored on the >>>> >>>> GPU rather than having to move it between main memory and GPU >>>> >>>> memory on each iteration, I would guess this would be critical to >>>> >>>> getting good performance. If you could do multiple local >>>> >>>> iterations before aggregating results, then the cost of data >>>> >>>> movement to the GPU could be amortized (and I believe that is done >>>> >>>> in practice). Having Spark be aware of the GPU and using it as >>>> another part of memory sounds like a much bigger undertaking. >>>> >>>> >>>> >>>> Joseph >>>> >>>> >>>> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < >>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >>>> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by >>>> >>>> John Canny and I am really inspired by his talk and comparisons >>>> with Spark MLlib. >>>> >>>> >>>> >>>> I am very interested to find out what will be better within Spark: >>>> >>>> BIDMat or netlib-java with CPU or GPU natives. Could you suggest a >>>> >>>> fair way to benchmark them? Currently I do benchmarks on artificial >>>> >>>> neural networks in batch mode. While it is not a “pure” test of >>>> >>>> linear algebra, it involves some other things that are essential >>>> to machine learning. >>>> >>>> >>>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >>>> evan.spa...@gmail.com><mailto: >>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>>> >>>> Sent: Thursday, February 05, 2015 1:29 PM >>>> >>>> To: Ulanov, Alexander >>>> >>>> Cc: >>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark >>>> . >>>> >>>> apache.org<mailto:dev@spark.apache.org>> >>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> >>>> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than >>>> >>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to >>>> >>>> netlib-java+data >>>> >>>> layout and fewer levels of indirection - it's definitely a >>>> >>>> worthwhile experiment to run. The main speedups I've seen from >>>> >>>> using it come from highly optimized GPU code for linear algebra. I >>>> >>>> know that in the past Canny has gone as far as to write custom GPU >>>> >>>> kernels for performance-critical regions of code.[1] >>>> >>>> >>>> >>>> BIDMach is highly optimized for single node performance or >>>> >>>> performance on small clusters.[2] Once data doesn't fit easily in >>>> >>>> GPU memory (or can be batched in that way) the performance tends to >>>> >>>> fall off. Canny argues for hardware/software codesign and as such >>>> >>>> prefers machine configurations that are quite different than what >>>> >>>> we find in most commodity cluster nodes - e.g. 10 disk cahnnels >>>> and 4 GPUs. >>>> >>>> >>>> >>>> In contrast, MLlib was designed for horizontal scalability on >>>> >>>> commodity clusters and works best on very big datasets - order of >>>> terabytes. >>>> >>>> >>>> >>>> For the most part, these projects developed concurrently to address >>>> >>>> slightly different use cases. That said, there may be bits of >>>> >>>> BIDMach we could repurpose for MLlib - keep in mind we need to be >>>> >>>> careful about maintaining cross-language compatibility for our Java >>>> >>>> and Python-users, though. >>>> >>>> >>>> >>>> - Evan >>>> >>>> >>>> >>>> [1] - http://arxiv.org/abs/1409.5402 [2] - >>>> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf >>>> >>>> >>>> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < >>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote: >>>> >>>> Hi Evan, >>>> >>>> >>>> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do >>>> >>>> you know what makes them faster than netlib-java? >>>> >>>> >>>> >>>> The same group has BIDMach library that implements machine >>>> >>>> learning. For some examples they use Caffe convolutional neural >>>> >>>> network library owned by another group in Berkeley. Could you >>>> >>>> elaborate on how these all might be connected with Spark Mllib? If >>>> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for >>>> optimization and learning? >>>> >>>> >>>> >>>> Best regards, Alexander >>>> >>>> >>>> >>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >>>> evan.spa...@gmail.com><mailto: >>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto: >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>> >>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>] >>>> >>>> Sent: Thursday, February 05, 2015 12:09 PM >>>> >>>> To: Ulanov, Alexander >>>> >>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto: >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto: >>>> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark >>>> . >>>> >>>> apache.org<mailto:dev@spark.apache.org>>> >>>> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>> >>>> >>>> >>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU >>>> >>>> blas in many cases. >>>> >>>> >>>> >>>> You might consider taking a look at the codepaths that BIDMat ( >>>> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to >>>> >>>> netlib-java/breeze. John Canny et. al. have done a bunch of work >>>> >>>> optimizing to make this work really fast from Scala. I've run it on >>>> >>>> my laptop and compared to MKL and in certain cases it's 10x faster >>>> at matrix multiply. >>>> >>>> There are a lot of layers of indirection here and you really want >>>> >>>> to avoid data copying as much as possible. >>>> >>>> >>>> >>>> We could also consider swapping out BIDMat for Breeze, but that >>>> >>>> would be a big project and if we can figure out how to get >>>> >>>> breeze+cublas to comparable performance that would be a big win. >>>> >>>> >>>> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < >>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >>>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote: >>>> >>>> Dear Spark developers, >>>> >>>> >>>> >>>> I am exploring how to make linear algebra operations faster within >>>> Spark. >>>> >>>> One way of doing this is to use Scala Breeze library that is >>>> >>>> bundled with Spark. For matrix operations, it employs Netlib-java >>>> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms) >>>> >>>> and LAPACK native binaries if they are available on the worker >>>> >>>> node. It also has its own optimized Java implementation of BLAS. It >>>> >>>> is worth mentioning, that native binaries provide better >>>> performance only for BLAS level 3, i.e. >>>> >>>> matrix-matrix operations or general matrix multiplication (GEMM). >>>> >>>> This is confirmed by GEMM test on Netlib-java page >>>> >>>> https://github.com/fommil/netlib-java. I also confirmed it with my >>>> >>>> experiments with training of artificial neural network >>>> >>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952. >>>> >>>> However, I would like to boost performance more. >>>> >>>> >>>> >>>> GPU is supposed to work fast with linear algebra and there is >>>> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux >>>> >>>> server with Nvidia GPU and I was able to do the following. I linked >>>> >>>> cublas (instead of cpu-based blas) with Netlib-java wrapper and put >>>> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some >>>> >>>> performance measurements with regards to artificial neural network >>>> >>>> batch learning in Spark MLlib that involves matrix-matrix >>>> >>>> multiplications. It turns out that for matrices of size less than >>>> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes >>>> >>>> slower for bigger matrices. It worth mentioning that it is was not >>>> a test for ONLY multiplication since there are other operations involved. >>>> >>>> One of the reasons for slowdown might be the overhead of copying >>>> >>>> the matrices from computer memory to graphic card memory and back. >>>> >>>> >>>> >>>> So, few questions: >>>> >>>> 1) Do these results with CUDA make sense? >>>> >>>> 2) If the problem is with copy overhead, are there any libraries >>>> >>>> that allow to force intermediate results to stay in graphic card >>>> >>>> memory thus removing the overhead? >>>> >>>> 3) Any other options to speed-up linear algebra in Spark? >>>> >>>> >>>> >>>> Thank you, Alexander >>>> >>>> >>>> >>>> ------------------------------------------------------------------- >>>> >>>> -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>> <mailto:dev-unsubscr...@spark.apache.org><mailto: >>>> >>>> dev-unsubscr...@spark.apache.org<mailto: >>>> dev-unsubscribe@spark.apach >>>> >>>> e.org>><mailto:dev-unsubscr...@spark.apac<mailto: >>>> dev-unsubscribe@sp >>>> >>>> ark.apac> he.org<http://he.org> >>>> >>>> <mailto:dev-unsubscr...@spark.apache.org<mailto: >>>> dev-unsubscribe@spa >>>> >>>> rk.apache.org>>> For additional commands, e-mail: >>>> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org >>>> ><mailto: >>>> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org >>>> >><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org >>>> ><mailto: >>>> >>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>>> >>>> -- >>>> Best regards, >>>> Sam >>>> >>> >>>