Better documentation for linking would be very helpful! Here's a JIRA: https://issues.apache.org/jira/browse/SPARK-6019
On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <evan.spa...@gmail.com> wrote: > Thanks for compiling all the data and running these benchmarks, Alex. The > big takeaways here can be seen with this chart: > > https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive > > 1) A properly configured GPU matrix multiply implementation (e.g. > BIDMat+GPU) can provide substantial (but less than an order of magnitude) > benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or > netlib-java+openblas-compiled). > 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse > than a well-tuned CPU implementation, particularly for larger matrices. > (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this > basically agrees with the authors own benchmarks ( > https://github.com/fommil/netlib-java) > > I think that most of our users are in a situation where using GPUs may not > be practical - although we could consider having a good GPU backend > available as an option. However, *ALL* users of MLlib could benefit > (potentially tremendously) from using a well-tuned CPU-based BLAS > implementation. Perhaps we should consider updating the mllib guide with a > more complete section for enabling high performance binaries on OSX and > Linux? Or better, figure out a way for the system to fetch these > automatically. > > - Evan > > > > On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < > alexander.ula...@hp.com> wrote: > >> Just to summarize this thread, I was finally able to make all performance >> comparisons that we discussed. It turns out that: >> BIDMat-cublas>>BIDMat >> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas >> >> Below is the link to the spreadsheet with full results. >> >> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >> >> One thing still needs exploration: does BIDMat-cublas perform copying >> to/from machine’s RAM? >> >> -----Original Message----- >> From: Ulanov, Alexander >> Sent: Tuesday, February 10, 2015 2:12 PM >> To: Evan R. Sparks >> Cc: Joseph Bradley; dev@spark.apache.org >> Subject: RE: Using CUDA within Spark / boosting linear algebra >> >> Thanks, Evan! It seems that ticket was marked as duplicate though the >> original one discusses slightly different topic. I was able to link netlib >> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a >> 60MB library. >> >> |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| >> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | >> +-----------------------------------------------------------------------+ >> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | >> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 >> |1,638475459 | >> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 | >> 1569,233228 | >> >> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on >> my machine. Probably, I’ll add two more columns with locally compiled >> openblas and cuda. >> >> Alexander >> >> From: Evan R. Sparks [mailto:evan.spa...@gmail.com] >> Sent: Monday, February 09, 2015 6:06 PM >> To: Ulanov, Alexander >> Cc: Joseph Bradley; dev@spark.apache.org >> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> Great - perhaps we can move this discussion off-list and onto a JIRA >> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) >> >> It seems like this is going to be somewhat exploratory for a while (and >> there's probably only a handful of us who really care about fast linear >> algebra!) >> >> - Evan >> >> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >> Hi Evan, >> >> Thank you for explanation and useful link. I am going to build OpenBLAS, >> link it with Netlib-java and perform benchmark again. >> >> Do I understand correctly that BIDMat binaries contain statically linked >> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not >> having MKL BLAS installed on my server. If it is true, I wonder if it is OK >> because Intel sells this library. Nevertheless, it seems that in my case >> precompiled MKL BLAS performs better than precompiled OpenBLAS given that >> BIDMat and Netlib-java are supposed to be on par with JNI overheads. >> >> Though, it might be interesting to link Netlib-java with Intel MKL, as >> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday >> (Netlib-java) interested to compare their libraries. >> >> Best regards, Alexander >> >> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >> evan.spa...@gmail.com>] >> Sent: Friday, February 06, 2015 5:58 PM >> >> To: Ulanov, Alexander >> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org> >> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> I would build OpenBLAS yourself, since good BLAS performance comes from >> getting cache sizes, etc. set up correctly for your particular hardware - >> this is often a very tricky process (see, e.g. ATLAS), but we found that on >> relatively modern Xeon chips, OpenBLAS builds quickly and yields >> performance competitive with MKL. >> >> To make sure the right library is getting used, you have to make sure >> it's first on the search path - export >> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. >> >> For some examples of getting netlib-java setup on an ec2 node and some >> example benchmarking code we ran a while back, see: >> https://github.com/shivaram/matrix-bench >> >> In particular - build-openblas-ec2.sh shows you how to build the library >> and set up symlinks correctly, and scala/run-netlib.sh shows you how to get >> the path setup and get that library picked up by netlib-java. >> >> In this way - you could probably get cuBLAS set up to be used by >> netlib-java as well. >> >> - Evan >> >> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >> Evan, could you elaborate on how to force BIDMat and netlib-java to force >> loading the right blas? For netlib, I there are few JVM flags, such as >> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can >> force it to use Java implementation. Not sure I understand how to force use >> a specific blas (not specific wrapper for blas). >> >> Btw. I have installed openblas (yum install openblas), so I suppose that >> netlib is using it. >> >> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >> evan.spa...@gmail.com>] >> Sent: Friday, February 06, 2015 5:19 PM >> To: Ulanov, Alexander >> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org> >> >> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> Getting breeze to pick up the right blas library is critical for >> performance. I recommend using OpenBLAS (or MKL, if you already have it). >> It might make sense to force BIDMat to use the same underlying BLAS library >> as well. >> >> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >> Hi Evan, Joseph >> >> I did few matrix multiplication test and BIDMat seems to be ~10x faster >> than netlib-java+breeze (sorry for weird table formatting): >> >> |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| >> Breeze+Netlib-java f2jblas | >> +-----------------------------------------------------------------------+ >> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | >> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | >> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 | >> >> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 >> Linux, Scala 2.11. >> >> Later I will make tests with Cuda. I need to install new Cuda version for >> this purpose. >> >> Do you have any ideas why breeze-netlib with native blas is so much >> slower than BIDMat MKL? >> >> Best regards, Alexander >> >> From: Joseph Bradley [mailto:jos...@databricks.com<mailto: >> jos...@databricks.com>] >> Sent: Thursday, February 05, 2015 5:29 PM >> To: Ulanov, Alexander >> Cc: Evan R. Sparks; dev@spark.apache.org<mailto:dev@spark.apache.org> >> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> Hi Alexander, >> >> Using GPUs with Spark would be very exciting. Small comment: Concerning >> your question earlier about keeping data stored on the GPU rather than >> having to move it between main memory and GPU memory on each iteration, I >> would guess this would be critical to getting good performance. If you >> could do multiple local iterations before aggregating results, then the >> cost of data movement to the GPU could be amortized (and I believe that is >> done in practice). Having Spark be aware of the GPU and using it as >> another part of memory sounds like a much bigger undertaking. >> >> Joseph >> >> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >> Thank you for explanation! I’ve watched the BIDMach presentation by John >> Canny and I am really inspired by his talk and comparisons with Spark MLlib. >> >> I am very interested to find out what will be better within Spark: BIDMat >> or netlib-java with CPU or GPU natives. Could you suggest a fair way to >> benchmark them? Currently I do benchmarks on artificial neural networks in >> batch mode. While it is not a “pure” test of linear algebra, it involves >> some other things that are essential to machine learning. >> >> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >> evan.spa...@gmail.com>] >> Sent: Thursday, February 05, 2015 1:29 PM >> To: Ulanov, Alexander >> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org> >> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> I'd be surprised of BIDMat+OpenBLAS was significantly faster than >> netlib-java+OpenBLAS, but if it is much faster it's probably due to data >> layout and fewer levels of indirection - it's definitely a worthwhile >> experiment to run. The main speedups I've seen from using it come from >> highly optimized GPU code for linear algebra. I know that in the past Canny >> has gone as far as to write custom GPU kernels for performance-critical >> regions of code.[1] >> >> BIDMach is highly optimized for single node performance or performance on >> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be >> batched in that way) the performance tends to fall off. Canny argues for >> hardware/software codesign and as such prefers machine configurations that >> are quite different than what we find in most commodity cluster nodes - >> e.g. 10 disk cahnnels and 4 GPUs. >> >> In contrast, MLlib was designed for horizontal scalability on commodity >> clusters and works best on very big datasets - order of terabytes. >> >> For the most part, these projects developed concurrently to address >> slightly different use cases. That said, there may be bits of BIDMach we >> could repurpose for MLlib - keep in mind we need to be careful about >> maintaining cross-language compatibility for our Java and Python-users, >> though. >> >> - Evan >> >> [1] - http://arxiv.org/abs/1409.5402 >> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf >> >> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >> Hi Evan, >> >> Thank you for suggestion! BIDMat seems to have terrific speed. Do you >> know what makes them faster than netlib-java? >> >> The same group has BIDMach library that implements machine learning. For >> some examples they use Caffe convolutional neural network library owned by >> another group in Berkeley. Could you elaborate on how these all might be >> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t >> you take BIDMach for optimization and learning? >> >> Best regards, Alexander >> >> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto: >> evan.spa...@gmail.com><mailto:evan.spa...@gmail.com<mailto: >> evan.spa...@gmail.com>>] >> Sent: Thursday, February 05, 2015 12:09 PM >> To: Ulanov, Alexander >> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto: >> dev@spark.apache.org<mailto:dev@spark.apache.org>> >> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in >> many cases. >> >> You might consider taking a look at the codepaths that BIDMat ( >> https://github.com/BIDData/BIDMat) takes and comparing them to >> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing >> to make this work really fast from Scala. I've run it on my laptop and >> compared to MKL and in certain cases it's 10x faster at matrix multiply. >> There are a lot of layers of indirection here and you really want to avoid >> data copying as much as possible. >> >> We could also consider swapping out BIDMat for Breeze, but that would be >> a big project and if we can figure out how to get breeze+cublas to >> comparable performance that would be a big win. >> >> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto: >> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote: >> Dear Spark developers, >> >> I am exploring how to make linear algebra operations faster within Spark. >> One way of doing this is to use Scala Breeze library that is bundled with >> Spark. For matrix operations, it employs Netlib-java that has a Java >> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native >> binaries if they are available on the worker node. It also has its own >> optimized Java implementation of BLAS. It is worth mentioning, that native >> binaries provide better performance only for BLAS level 3, i.e. >> matrix-matrix operations or general matrix multiplication (GEMM). This is >> confirmed by GEMM test on Netlib-java page >> https://github.com/fommil/netlib-java. I also confirmed it with my >> experiments with training of artificial neural network >> https://github.com/apache/spark/pull/1290#issuecomment-70313952. >> However, I would like to boost performance more. >> >> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA >> implementation of BLAS, called cublas. I have one Linux server with Nvidia >> GPU and I was able to do the following. I linked cublas (instead of >> cpu-based blas) with Netlib-java wrapper and put it into Spark, so >> Breeze/Netlib is using it. Then I did some performance measurements with >> regards to artificial neural network batch learning in Spark MLlib that >> involves matrix-matrix multiplications. It turns out that for matrices of >> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas >> becomes slower for bigger matrices. It worth mentioning that it is was not >> a test for ONLY multiplication since there are other operations involved. >> One of the reasons for slowdown might be the overhead of copying the >> matrices from computer memory to graphic card memory and back. >> >> So, few questions: >> 1) Do these results with CUDA make sense? >> 2) If the problem is with copy overhead, are there any libraries that >> allow to force intermediate results to stay in graphic card memory thus >> removing the overhead? >> 3) Any other options to speed-up linear algebra in Spark? >> >> Thank you, Alexander >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto: >> dev-unsubscr...@spark.apache.org><mailto:dev-unsubscr...@spark.apache.org >> <mailto:dev-unsubscr...@spark.apache.org>> >> For additional commands, e-mail: dev-h...@spark.apache.org<mailto: >> dev-h...@spark.apache.org><mailto:dev-h...@spark.apache.org<mailto: >> dev-h...@spark.apache.org>> >> >> >> >> >