Re: Using CUDA within Spark / boosting linear algebra

Chester @work Mon, 09 Feb 2015 17:38:23 -0800

Maybe you can ask prof john canny himself:-)  as I invited him to give a talk 
at Alpine data labs in March's meetup (SF big Analytics & SF machine learning 
joined meetup) , 3/11. To be announced in next day or so.


Chester

Sent from my iPhone

> On Feb 9, 2015, at 4:48 PM, "Ulanov, Alexander" <alexander.ula...@hp.com> 
> wrote:
> 
> Hi Evan,
> 
> Thank you for explanation and useful link. I am going to build OpenBLAS, link 
> it with Netlib-java and perform benchmark again.
> 
> Do I understand correctly that BIDMat binaries contain statically linked 
> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not having 
> MKL BLAS installed on my server. If it is true, I wonder if it is OK because 
> Intel sells this library. Nevertheless, it seems that in my case precompiled 
> MKL BLAS performs better than precompiled OpenBLAS given that BIDMat and 
> Netlib-java are supposed to be on par with JNI overheads.
> 
> Though, it might be interesting to link Netlib-java with Intel MKL, as you 
> suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java) 
> interested to compare their libraries.
> 
> Best regards, Alexander
> 
> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
> Sent: Friday, February 06, 2015 5:58 PM
> To: Ulanov, Alexander
> Cc: Joseph Bradley; dev@spark.apache.org
> Subject: Re: Using CUDA within Spark / boosting linear algebra
> 
> I would build OpenBLAS yourself, since good BLAS performance comes from 
> getting cache sizes, etc. set up correctly for your particular hardware - 
> this is often a very tricky process (see, e.g. ATLAS), but we found that on 
> relatively modern Xeon chips, OpenBLAS builds quickly and yields performance 
> competitive with MKL.
> 
> To make sure the right library is getting used, you have to make sure it's 
> first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so 
> will do the trick here.
> 
> For some examples of getting netlib-java setup on an ec2 node and some 
> example benchmarking code we ran a while back, see: 
> https://github.com/shivaram/matrix-bench
> 
> In particular - build-openblas-ec2.sh shows you how to build the library and 
> set up symlinks correctly, and scala/run-netlib.sh shows you how to get the 
> path setup and get that library picked up by netlib-java.
> 
> In this way - you could probably get cuBLAS set up to be used by netlib-java 
> as well.
> 
> - Evan
> 
> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
> <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
> Evan, could you elaborate on how to force BIDMat and netlib-java to force 
> loading the right blas? For netlib, I there are few JVM flags, such as 
> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
> force it to use Java implementation. Not sure I understand how to force use a 
> specific blas (not specific wrapper for blas).
> 
> Btw. I have installed openblas (yum install openblas), so I suppose that 
> netlib is using it.
> 
> From: Evan R. Sparks 
> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
> Sent: Friday, February 06, 2015 5:19 PM
> To: Ulanov, Alexander
> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org>
> 
> Subject: Re: Using CUDA within Spark / boosting linear algebra
> 
> Getting breeze to pick up the right blas library is critical for performance. 
> I recommend using OpenBLAS (or MKL, if you already have it). It might make 
> sense to force BIDMat to use the same underlying BLAS library as well.
> 
> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
> <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
> Hi Evan, Joseph
> 
> I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
> netlib-java+breeze (sorry for weird table formatting):
> 
> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
> Breeze+Netlib-java f2jblas |
> +-----------------------------------------------------------------------+
> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
> 
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
> Scala 2.11.
> 
> Later I will make tests with Cuda. I need to install new Cuda version for 
> this purpose.
> 
> Do you have any ideas why breeze-netlib with native blas is so much slower 
> than BIDMat MKL?
> 
> Best regards, Alexander
> 
> From: Joseph Bradley 
> [mailto:jos...@databricks.com<mailto:jos...@databricks.com>]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; dev@spark.apache.org<mailto:dev@spark.apache.org>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
> 
> Hi Alexander,
> 
> Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
> question earlier about keeping data stored on the GPU rather than having to 
> move it between main memory and GPU memory on each iteration, I would guess 
> this would be critical to getting good performance.  If you could do multiple 
> local iterations before aggregating results, then the cost of data movement 
> to the GPU could be amortized (and I believe that is done in practice).  
> Having Spark be aware of the GPU and using it as another part of memory 
> sounds like a much bigger undertaking.
> 
> Joseph
> 
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
> <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John 
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
> 
> I am very interested to find out what will be better within Spark: BIDMat or 
> netlib-java with CPU or GPU natives. Could you suggest a fair way to 
> benchmark them? Currently I do benchmarks on artificial neural networks in 
> batch mode. While it is not a “pure” test of linear algebra, it involves some 
> other things that are essential to machine learning.
> 
> From: Evan R. Sparks 
> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
> 
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
> netlib-java+OpenBLAS, but if it is much faster it's probably due to data 
> layout and fewer levels of indirection - it's definitely a worthwhile 
> experiment to run. The main speedups I've seen from using it come from highly 
> optimized GPU code for linear algebra. I know that in the past Canny has gone 
> as far as to write custom GPU kernels for performance-critical regions of 
> code.[1]
> 
> BIDMach is highly optimized for single node performance or performance on 
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be 
> batched in that way) the performance tends to fall off. Canny argues for 
> hardware/software codesign and as such prefers machine configurations that 
> are quite different than what we find in most commodity cluster nodes - e.g. 
> 10 disk cahnnels and 4 GPUs.
> 
> In contrast, MLlib was designed for horizontal scalability on commodity 
> clusters and works best on very big datasets - order of terabytes.
> 
> For the most part, these projects developed concurrently to address slightly 
> different use cases. That said, there may be bits of BIDMach we could 
> repurpose for MLlib - keep in mind we need to be careful about maintaining 
> cross-language compatibility for our Java and Python-users, though.
> 
> - Evan
> 
> [1] - http://arxiv.org/abs/1409.5402
> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> 
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander 
> <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>  wrote:
> Hi Evan,
> 
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know 
> what makes them faster than netlib-java?
> 
> The same group has BIDMach library that implements machine learning. For some 
> examples they use Caffe convolutional neural network library owned by another 
> group in Berkeley. Could you elaborate on how these all might be connected 
> with Spark Mllib? If you take BIDMat for linear algebra why don’t you take 
> BIDMach for optimization and learning?
> 
> Best regards, Alexander
> 
> From: Evan R. Sparks 
> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: 
> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
> 
> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
> cases.
> 
> You might consider taking a look at the codepaths that BIDMat 
> (https://github.com/BIDData/BIDMat) takes and comparing them to 
> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing 
> to make this work really fast from Scala. I've run it on my laptop and 
> compared to MKL and in certain cases it's 10x faster at matrix multiply. 
> There are a lot of layers of indirection here and you really want to avoid 
> data copying as much as possible.
> 
> We could also consider swapping out BIDMat for Breeze, but that would be a 
> big project and if we can figure out how to get breeze+cublas to comparable 
> performance that would be a big win.
> 
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
> <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>  wrote:
> Dear Spark developers,
> 
> I am exploring how to make linear algebra operations faster within Spark. One 
> way of doing this is to use Scala Breeze library that is bundled with Spark. 
> For matrix operations, it employs Netlib-java that has a Java wrapper for 
> BLAS (basic linear algebra subprograms) and LAPACK native binaries if they 
> are available on the worker node. It also has its own optimized Java 
> implementation of BLAS. It is worth mentioning, that native binaries provide 
> better performance only for BLAS level 3, i.e. matrix-matrix operations or 
> general matrix multiplication (GEMM). This is confirmed by GEMM test on 
> Netlib-java page https://github.com/fommil/netlib-java. I also confirmed it 
> with my experiments with training of artificial neural network 
> https://github.com/apache/spark/pull/1290#issuecomment-70313952. However, I 
> would like to boost performance more.
> 
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA 
> implementation of BLAS, called cublas. I have one Linux server with Nvidia 
> GPU and I was able to do the following. I linked cublas (instead of cpu-based 
> blas) with Netlib-java wrapper and put it into Spark, so Breeze/Netlib is 
> using it. Then I did some performance measurements with regards to artificial 
> neural network batch learning in Spark MLlib that involves matrix-matrix 
> multiplications. It turns out that for matrices of size less than ~1000x780 
> GPU cublas has the same speed as CPU blas. Cublas becomes slower for bigger 
> matrices. It worth mentioning that it is was not a test for ONLY 
> multiplication since there are other operations involved. One of the reasons 
> for slowdown might be the overhead of copying the matrices from computer 
> memory to graphic card memory and back.
> 
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that allow 
> to force intermediate results to stay in graphic card memory thus removing 
> the overhead?
> 3) Any other options to speed-up linear algebra in Spark?
> 
> Thank you, Alexander
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org><mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>>
> For additional commands, e-mail: 
> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

Reply via email to