Re: Using CUDA within Spark / boosting linear algebra

Joseph Bradley Wed, 25 Feb 2015 15:39:52 -0800

Better documentation for linking would be very helpful!  Here's a JIRA:
https://issues.apache.org/jira/browse/SPARK-6019



On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:

> Thanks for compiling all the data and running these benchmarks, Alex. The
> big takeaways here can be seen with this chart:
>
> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>
> 1) A properly configured GPU matrix multiply implementation (e.g.
> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> netlib-java+openblas-compiled).
> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
> than a well-tuned CPU implementation, particularly for larger matrices.
> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> basically agrees with the authors own benchmarks (
> https://github.com/fommil/netlib-java)
>
> I think that most of our users are in a situation where using GPUs may not
> be practical - although we could consider having a good GPU backend
> available as an option. However, *ALL* users of MLlib could benefit
> (potentially tremendously) from using a well-tuned CPU-based BLAS
> implementation. Perhaps we should consider updating the mllib guide with a
> more complete section for enabling high performance binaries on OSX and
> Linux? Or better, figure out a way for the system to fetch these
> automatically.
>
> - Evan
>
>
>
> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> alexander.ula...@hp.com> wrote:
>
>> Just to summarize this thread, I was finally able to make all performance
>> comparisons that we discussed. It turns out that:
>> BIDMat-cublas>>BIDMat
>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>>
>> Below is the link to the spreadsheet with full results.
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> One thing still needs exploration: does BIDMat-cublas perform copying
>> to/from machine’s RAM?
>>
>> -----Original Message-----
>> From: Ulanov, Alexander
>> Sent: Tuesday, February 10, 2015 2:12 PM
>> To: Evan R. Sparks
>> Cc: Joseph Bradley; dev@spark.apache.org
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>> original one discusses slightly different topic. I was able to link netlib
>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>> 60MB library.
>>
>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> +-----------------------------------------------------------------------+
>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> |1,638475459 |
>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>> 1569,233228 |
>>
>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
>> my machine. Probably, I’ll add two more columns with locally compiled
>> openblas and cuda.
>>
>> Alexander
>>
>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
>> Sent: Monday, February 09, 2015 6:06 PM
>> To: Ulanov, Alexander
>> Cc: Joseph Bradley; dev@spark.apache.org
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> Great - perhaps we can move this discussion off-list and onto a JIRA
>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
>>
>> It seems like this is going to be somewhat exploratory for a while (and
>> there's probably only a handful of us who really care about fast linear
>> algebra!)
>>
>> - Evan
>>
>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>> Hi Evan,
>>
>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>> link it with Netlib-java and perform benchmark again.
>>
>> Do I understand correctly that BIDMat binaries contain statically linked
>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
>> because Intel sells this library. Nevertheless, it seems that in my case
>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>
>> Though, it might be interesting to link Netlib-java with Intel MKL, as
>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>> (Netlib-java) interested to compare their libraries.
>>
>> Best regards, Alexander
>>
>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>> evan.spa...@gmail.com>]
>> Sent: Friday, February 06, 2015 5:58 PM
>>
>> To: Ulanov, Alexander
>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> I would build OpenBLAS yourself, since good BLAS performance comes from
>> getting cache sizes, etc. set up correctly for your particular hardware -
>> this is often a very tricky process (see, e.g. ATLAS), but we found that on
>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>> performance competitive with MKL.
>>
>> To make sure the right library is getting used, you have to make sure
>> it's first on the search path - export
>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>
>> For some examples of getting netlib-java setup on an ec2 node and some
>> example benchmarking code we ran a while back, see:
>> https://github.com/shivaram/matrix-bench
>>
>> In particular - build-openblas-ec2.sh shows you how to build the library
>> and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
>> the path setup and get that library picked up by netlib-java.
>>
>> In this way - you could probably get cuBLAS set up to be used by
>> netlib-java as well.
>>
>> - Evan
>>
>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>> Evan, could you elaborate on how to force BIDMat and netlib-java to force
>> loading the right blas? For netlib, I there are few JVM flags, such as
>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
>> force it to use Java implementation. Not sure I understand how to force use
>> a specific blas (not specific wrapper for blas).
>>
>> Btw. I have installed openblas (yum install openblas), so I suppose that
>> netlib is using it.
>>
>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>> evan.spa...@gmail.com>]
>> Sent: Friday, February 06, 2015 5:19 PM
>> To: Ulanov, Alexander
>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org>
>>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> Getting breeze to pick up the right blas library is critical for
>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>> It might make sense to force BIDMat to use the same underlying BLAS library
>> as well.
>>
>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>> Hi Evan, Joseph
>>
>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
>> than netlib-java+breeze (sorry for weird table formatting):
>>
>> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
>> Breeze+Netlib-java f2jblas |
>> +-----------------------------------------------------------------------+
>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>>
>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>> Linux, Scala 2.11.
>>
>> Later I will make tests with Cuda. I need to install new Cuda version for
>> this purpose.
>>
>> Do you have any ideas why breeze-netlib with native blas is so much
>> slower than BIDMat MKL?
>>
>> Best regards, Alexander
>>
>> From: Joseph Bradley [mailto:jos...@databricks.com<mailto:
>> jos...@databricks.com>]
>> Sent: Thursday, February 05, 2015 5:29 PM
>> To: Ulanov, Alexander
>> Cc: Evan R. Sparks; dev@spark.apache.org<mailto:dev@spark.apache.org>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> Hi Alexander,
>>
>> Using GPUs with Spark would be very exciting.  Small comment: Concerning
>> your question earlier about keeping data stored on the GPU rather than
>> having to move it between main memory and GPU memory on each iteration, I
>> would guess this would be critical to getting good performance.  If you
>> could do multiple local iterations before aggregating results, then the
>> cost of data movement to the GPU could be amortized (and I believe that is
>> done in practice).  Having Spark be aware of the GPU and using it as
>> another part of memory sounds like a much bigger undertaking.
>>
>> Joseph
>>
>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>> Thank you for explanation! I’ve watched the BIDMach presentation by John
>> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>>
>> I am very interested to find out what will be better within Spark: BIDMat
>> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
>> benchmark them? Currently I do benchmarks on artificial neural networks in
>> batch mode. While it is not a “pure” test of linear algebra, it involves
>> some other things that are essential to machine learning.
>>
>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>> evan.spa...@gmail.com>]
>> Sent: Thursday, February 05, 2015 1:29 PM
>> To: Ulanov, Alexander
>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
>> layout and fewer levels of indirection - it's definitely a worthwhile
>> experiment to run. The main speedups I've seen from using it come from
>> highly optimized GPU code for linear algebra. I know that in the past Canny
>> has gone as far as to write custom GPU kernels for performance-critical
>> regions of code.[1]
>>
>> BIDMach is highly optimized for single node performance or performance on
>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
>> batched in that way) the performance tends to fall off. Canny argues for
>> hardware/software codesign and as such prefers machine configurations that
>> are quite different than what we find in most commodity cluster nodes -
>> e.g. 10 disk cahnnels and 4 GPUs.
>>
>> In contrast, MLlib was designed for horizontal scalability on commodity
>> clusters and works best on very big datasets - order of terabytes.
>>
>> For the most part, these projects developed concurrently to address
>> slightly different use cases. That said, there may be bits of BIDMach we
>> could repurpose for MLlib - keep in mind we need to be careful about
>> maintaining cross-language compatibility for our Java and Python-users,
>> though.
>>
>> - Evan
>>
>> [1] - http://arxiv.org/abs/1409.5402
>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>
>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
>> Hi Evan,
>>
>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>> know what makes them faster than netlib-java?
>>
>> The same group has BIDMach library that implements machine learning. For
>> some examples they use Caffe convolutional neural network library owned by
>> another group in Berkeley. Could you elaborate on how these all might be
>> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
>> you take BIDMach for optimization and learning?
>>
>> Best regards, Alexander
>>
>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
>> evan.spa...@gmail.com><mailto:evan.spa...@gmail.com<mailto:
>> evan.spa...@gmail.com>>]
>> Sent: Thursday, February 05, 2015 12:09 PM
>> To: Ulanov, Alexander
>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
>> dev@spark.apache.org<mailto:dev@spark.apache.org>>
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
>> many cases.
>>
>> You might consider taking a look at the codepaths that BIDMat (
>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
>> to make this work really fast from Scala. I've run it on my laptop and
>> compared to MKL and in certain cases it's 10x faster at matrix multiply.
>> There are a lot of layers of indirection here and you really want to avoid
>> data copying as much as possible.
>>
>> We could also consider swapping out BIDMat for Breeze, but that would be
>> a big project and if we can figure out how to get breeze+cublas to
>> comparable performance that would be a big win.
>>
>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
>> Dear Spark developers,
>>
>> I am exploring how to make linear algebra operations faster within Spark.
>> One way of doing this is to use Scala Breeze library that is bundled with
>> Spark. For matrix operations, it employs Netlib-java that has a Java
>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>> binaries if they are available on the worker node. It also has its own
>> optimized Java implementation of BLAS. It is worth mentioning, that native
>> binaries provide better performance only for BLAS level 3, i.e.
>> matrix-matrix operations or general matrix multiplication (GEMM). This is
>> confirmed by GEMM test on Netlib-java page
>> https://github.com/fommil/netlib-java. I also confirmed it with my
>> experiments with training of artificial neural network
>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> However, I would like to boost performance more.
>>
>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
>> implementation of BLAS, called cublas. I have one Linux server with Nvidia
>> GPU and I was able to do the following. I linked cublas (instead of
>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>> Breeze/Netlib is using it. Then I did some performance measurements with
>> regards to artificial neural network batch learning in Spark MLlib that
>> involves matrix-matrix multiplications. It turns out that for matrices of
>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>> becomes slower for bigger matrices. It worth mentioning that it is was not
>> a test for ONLY multiplication since there are other operations involved.
>> One of the reasons for slowdown might be the overhead of copying the
>> matrices from computer memory to graphic card memory and back.
>>
>> So, few questions:
>> 1) Do these results with CUDA make sense?
>> 2) If the problem is with copy overhead, are there any libraries that
>> allow to force intermediate results to stay in graphic card memory thus
>> removing the overhead?
>> 3) Any other options to speed-up linear algebra in Spark?
>>
>> Thank you, Alexander
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:
>> dev-unsubscr...@spark.apache.org><mailto:dev-unsubscr...@spark.apache.org
>> <mailto:dev-unsubscr...@spark.apache.org>>
>> For additional commands, e-mail: dev-h...@spark.apache.org<mailto:
>> dev-h...@spark.apache.org><mailto:dev-h...@spark.apache.org<mailto:
>> dev-h...@spark.apache.org>>
>>
>>
>>
>>
>

Re: Using CUDA within Spark / boosting linear algebra

Reply via email to