On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday <sam.halli...@gmail.com> wrote: > Also, check the JNILoader output. > > Remember, for netlib-java to use your system libblas all you need to do is > setup libblas.so.3 like any native application would expect. > > I haven't ever used the cublas "real BLAS" implementation, so I'd be > interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check > that all the runtime links are in order. >
There are two shared libraries in this hybrid setup. nvblas.so must be loaded before libblas.so to intercept level 3 routines using GPU. More details are at: http://docs.nvidia.com/cuda/nvblas/index.html#Usage > Btw, I have some DGEMM wrappers in my netlib-java performance module... and > I also planned to write more in MultiBLAS (until I mothballed the project > for the hardware to catch up, which is probably has and now I just need a > reason to look at it) > > On 27 Feb 2015 20:26, "Xiangrui Meng" <men...@gmail.com> wrote: >> >> Hey Sam, >> >> The running times are not "big O" estimates: >> >> > The CPU version finished in 12 seconds. >> > The CPU->GPU->CPU version finished in 2.2 seconds. >> > The GPU version finished in 1.7 seconds. >> >> I think there is something wrong with the netlib/cublas combination. >> Sam already mentioned that cuBLAS doesn't implement the CPU BLAS >> interfaces. I checked the CUDA doc and it seems that to use GPU BLAS >> through the CPU BLAS interface we need to use NVBLAS, which intercepts >> some Level 3 CPU BLAS calls (including GEMM). So we need to load >> nvblas.so first and then some CPU BLAS library in JNI. I wonder >> whether the setup was correct. >> >> Alexander, could you check whether GPU is used in the netlib-cublas >> experiments? You can tell it by watching CPU/GPU usage. >> >> Best, >> Xiangrui >> >> On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday <sam.halli...@gmail.com> >> wrote: >> > Don't use "big O" estimates, always measure. It used to work back in the >> > days when double multiplication was a bottleneck. The computation cost >> > is >> > effectively free on both the CPU and GPU and you're seeing pure copying >> > costs. Also, I'm dubious that cublas is doing what you think it is. Can >> > you >> > link me to the source code for DGEMM? >> > >> > I show all of this in my talk, with explanations, I can't stress enough >> > how >> > much I recommend that you watch it if you want to understand high >> > performance hardware acceleration for linear algebra :-) >> > >> > On 27 Feb 2015 01:42, "Xiangrui Meng" <men...@gmail.com> wrote: >> >> >> >> The copying overhead should be quadratic on n, while the computation >> >> cost is cubic on n. I can understand that netlib-cublas is slower than >> >> netlib-openblas on small problems. But I'm surprised to see that it is >> >> still 20x slower on 10000x10000. I did the following on a g2.2xlarge >> >> instance with BIDMat: >> >> >> >> val n = 10000 >> >> >> >> val f = rand(n, n) >> >> flip; f*f; val rf = flop >> >> >> >> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = >> >> flop >> >> >> >> flip; g*g; val rgg = flop >> >> >> >> The CPU version finished in 12 seconds. >> >> The CPU->GPU->CPU version finished in 2.2 seconds. >> >> The GPU version finished in 1.7 seconds. >> >> >> >> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas >> >> path. But based on the result, the data copying overhead is definitely >> >> not as big as 20x at n = 10000. >> >> >> >> Best, >> >> Xiangrui >> >> >> >> >> >> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <sam.halli...@gmail.com> >> >> wrote: >> >> > I've had some email exchanges with the author of BIDMat: it does >> >> > exactly >> >> > what you need to get the GPU benefit and writes higher level >> >> > algorithms >> >> > entirely in the GPU kernels so that the memory stays there as long as >> >> > possible. The restriction with this approach is that it is only >> >> > offering >> >> > high-level algorithms so is not a toolkit for applied mathematics >> >> > research and development --- but it works well as a toolkit for >> >> > higher >> >> > level analysis (e.g. for analysts and practitioners). >> >> > >> >> > I believe BIDMat's approach is the best way to get performance out of >> >> > GPU hardware at the moment but I also have strong evidence to suggest >> >> > that the hardware will catch up and the memory transfer costs between >> >> > CPU/GPU will disappear meaning that there will be no need for custom >> >> > GPU >> >> > kernel implementations. i.e. please continue to use BLAS primitives >> >> > when >> >> > writing new algorithms and only go to the GPU for an alternative >> >> > optimised implementation. >> >> > >> >> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and >> >> > offer >> >> > an API that looks like BLAS but takes pointers to special regions in >> >> > the >> >> > GPU memory region. Somebody has written a wrapper around CUDA to >> >> > create >> >> > a proper BLAS library but it only gives marginal performance over the >> >> > CPU because of the memory transfer overhead. >> >> > >> >> > This slide from my talk >> >> > >> >> > http://fommil.github.io/scalax14/#/11/2 >> >> > >> >> > says it all. X axis is matrix size, Y axis is logarithmic time to do >> >> > DGEMM. Black line is the "cheating" time for the GPU and the green >> >> > line >> >> > is after copying the memory to/from the GPU memory. APUs have the >> >> > potential to eliminate the green line. >> >> > >> >> > Best regards, >> >> > Sam >> >> > >> >> > >> >> > >> >> > "Ulanov, Alexander" <alexander.ula...@hp.com> writes: >> >> > >> >> >> Evan, thank you for the summary. I would like to add some more >> >> >> observations. The GPU that I used is 2.5 times cheaper than the CPU >> >> >> ($250 vs >> >> >> $100). They both are 3 years old. I've also did a small test with >> >> >> modern >> >> >> hardware, and the new GPU nVidia Titan was slightly more than 1 >> >> >> order of >> >> >> magnitude faster than Intel E5-2650 v2 for the same tests. However, >> >> >> it costs >> >> >> as much as CPU ($1200). My takeaway is that GPU is making a better >> >> >> price/value progress. >> >> >> >> >> >> >> >> >> >> >> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than >> >> >> netlib-cuda and the most reasonable explanation is that it holds the >> >> >> result >> >> >> in GPU memory, as Sam suggested. At the same time, it is OK because >> >> >> you can >> >> >> copy the result back from GPU only when needed. However, to be sure, >> >> >> I am >> >> >> going to ask the developer of BIDMat on his upcoming talk. >> >> >> >> >> >> >> >> >> >> >> >> Best regards, Alexander >> >> >> >> >> >> >> >> >> From: Sam Halliday [mailto:sam.halli...@gmail.com] >> >> >> Sent: Thursday, February 26, 2015 1:56 PM >> >> >> To: Xiangrui Meng >> >> >> Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. >> >> >> Sparks >> >> >> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> >> >> >> >> >> >> >> Btw, I wish people would stop cheating when comparing CPU and GPU >> >> >> timings for things like matrix multiply :-P >> >> >> >> >> >> Please always compare apples with apples and include the time it >> >> >> takes >> >> >> to set up the matrices, send it to the processing unit, doing the >> >> >> calculation AND copying it back to where you need to see the >> >> >> results. >> >> >> >> >> >> Ignoring this method will make you believe that your GPU is >> >> >> thousands >> >> >> of times faster than it really is. Again, jump to the end of my talk >> >> >> for >> >> >> graphs and more discussion.... especially the bit about me being >> >> >> keen on >> >> >> funding to investigate APU hardware further ;-) (I believe it will >> >> >> solve the >> >> >> problem) >> >> >> On 26 Feb 2015 21:16, "Xiangrui Meng" >> >> >> <men...@gmail.com<mailto:men...@gmail.com>> wrote: >> >> >> Hey Alexander, >> >> >> >> >> >> I don't quite understand the part where netlib-cublas is about 20x >> >> >> slower than netlib-openblas. What is the overhead of using a GPU >> >> >> BLAS >> >> >> with netlib-java? >> >> >> >> >> >> CC'ed Sam, the author of netlib-java. >> >> >> >> >> >> Best, >> >> >> Xiangrui >> >> >> >> >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley >> >> >> <jos...@databricks.com<mailto:jos...@databricks.com>> wrote: >> >> >>> Better documentation for linking would be very helpful! Here's a >> >> >>> JIRA: >> >> >>> https://issues.apache.org/jira/browse/SPARK-6019 >> >> >>> >> >> >>> >> >> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks >> >> >>> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>> >> >> >>> wrote: >> >> >>> >> >> >>>> Thanks for compiling all the data and running these benchmarks, >> >> >>>> Alex. >> >> >>>> The >> >> >>>> big takeaways here can be seen with this chart: >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive >> >> >>>> >> >> >>>> 1) A properly configured GPU matrix multiply implementation (e.g. >> >> >>>> BIDMat+GPU) can provide substantial (but less than an order of >> >> >>>> magnitude) >> >> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or >> >> >>>> netlib-java+openblas-compiled). >> >> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of >> >> >>>> magnitude >> >> >>>> worse >> >> >>>> than a well-tuned CPU implementation, particularly for larger >> >> >>>> matrices. >> >> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - >> >> >>>> this >> >> >>>> basically agrees with the authors own benchmarks ( >> >> >>>> https://github.com/fommil/netlib-java) >> >> >>>> >> >> >>>> I think that most of our users are in a situation where using GPUs >> >> >>>> may not >> >> >>>> be practical - although we could consider having a good GPU >> >> >>>> backend >> >> >>>> available as an option. However, *ALL* users of MLlib could >> >> >>>> benefit >> >> >>>> (potentially tremendously) from using a well-tuned CPU-based BLAS >> >> >>>> implementation. Perhaps we should consider updating the mllib >> >> >>>> guide >> >> >>>> with a >> >> >>>> more complete section for enabling high performance binaries on >> >> >>>> OSX >> >> >>>> and >> >> >>>> Linux? Or better, figure out a way for the system to fetch these >> >> >>>> automatically. >> >> >>>> >> >> >>>> - Evan >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < >> >> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >> >> >>>> >> >> >>>>> Just to summarize this thread, I was finally able to make all >> >> >>>>> performance >> >> >>>>> comparisons that we discussed. It turns out that: >> >> >>>>> BIDMat-cublas>>BIDMat >> >> >>>>> >> >> >>>>> >> >> >>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas >> >> >>>>> >> >> >>>>> Below is the link to the spreadsheet with full results. >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >> >> >>>>> >> >> >>>>> One thing still needs exploration: does BIDMat-cublas perform >> >> >>>>> copying >> >> >>>>> to/from machine’s RAM? >> >> >>>>> >> >> >>>>> -----Original Message----- >> >> >>>>> From: Ulanov, Alexander >> >> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM >> >> >>>>> To: Evan R. Sparks >> >> >>>>> Cc: Joseph Bradley; >> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >> >> >>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra >> >> >>>>> >> >> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate though >> >> >>>>> the >> >> >>>>> original one discusses slightly different topic. I was able to >> >> >>>>> link >> >> >>>>> netlib >> >> >>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked >> >> >>>>> inside a >> >> >>>>> 60MB library. >> >> >>>>> >> >> >>>>> |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| >> >> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | >> >> >>>>> >> >> >>>>> >> >> >>>>> +-----------------------------------------------------------------------+ >> >> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 >> >> >>>>> | >> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 >> >> >>>>> |1,638475459 | >> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 >> >> >>>>> | >> >> >>>>> 1569,233228 | >> >> >>>>> >> >> >>>>> It turn out that pre-compiled MKL is faster than precompiled >> >> >>>>> OpenBlas on >> >> >>>>> my machine. Probably, I’ll add two more columns with locally >> >> >>>>> compiled >> >> >>>>> openblas and cuda. >> >> >>>>> >> >> >>>>> Alexander >> >> >>>>> >> >> >>>>> From: Evan R. Sparks >> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>] >> >> >>>>> Sent: Monday, February 09, 2015 6:06 PM >> >> >>>>> To: Ulanov, Alexander >> >> >>>>> Cc: Joseph Bradley; >> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org> >> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> >>>>> >> >> >>>>> Great - perhaps we can move this discussion off-list and onto a >> >> >>>>> JIRA >> >> >>>>> ticket? (Here's one: >> >> >>>>> https://issues.apache.org/jira/browse/SPARK-5705) >> >> >>>>> >> >> >>>>> It seems like this is going to be somewhat exploratory for a >> >> >>>>> while >> >> >>>>> (and >> >> >>>>> there's probably only a handful of us who really care about fast >> >> >>>>> linear >> >> >>>>> algebra!) >> >> >>>>> >> >> >>>>> - Evan >> >> >>>>> >> >> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < >> >> >>>>> >> >> >>>>> >> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >> >> >>>>> wrote: >> >> >>>>> Hi Evan, >> >> >>>>> >> >> >>>>> Thank you for explanation and useful link. I am going to build >> >> >>>>> OpenBLAS, >> >> >>>>> link it with Netlib-java and perform benchmark again. >> >> >>>>> >> >> >>>>> Do I understand correctly that BIDMat binaries contain statically >> >> >>>>> linked >> >> >>>>> Intel MKL BLAS? It might be the reason why I am able to run >> >> >>>>> BIDMat >> >> >>>>> not >> >> >>>>> having MKL BLAS installed on my server. If it is true, I wonder >> >> >>>>> if >> >> >>>>> it is OK >> >> >>>>> because Intel sells this library. Nevertheless, it seems that in >> >> >>>>> my >> >> >>>>> case >> >> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS >> >> >>>>> given >> >> >>>>> that >> >> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI >> >> >>>>> overheads. >> >> >>>>> >> >> >>>>> Though, it might be interesting to link Netlib-java with Intel >> >> >>>>> MKL, >> >> >>>>> as >> >> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday >> >> >>>>> (Netlib-java) interested to compare their libraries. >> >> >>>>> >> >> >>>>> Best regards, Alexander >> >> >>>>> >> >> >>>>> From: Evan R. Sparks >> >> >>>>> >> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >> >> >>>>> Sent: Friday, February 06, 2015 5:58 PM >> >> >>>>> >> >> >>>>> To: Ulanov, Alexander >> >> >>>>> Cc: Joseph Bradley; >> >> >>>>> >> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>> >> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> >>>>> >> >> >>>>> I would build OpenBLAS yourself, since good BLAS performance >> >> >>>>> comes >> >> >>>>> from >> >> >>>>> getting cache sizes, etc. set up correctly for your particular >> >> >>>>> hardware - >> >> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we >> >> >>>>> found >> >> >>>>> that on >> >> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields >> >> >>>>> performance competitive with MKL. >> >> >>>>> >> >> >>>>> To make sure the right library is getting used, you have to make >> >> >>>>> sure >> >> >>>>> it's first on the search path - export >> >> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. >> >> >>>>> >> >> >>>>> For some examples of getting netlib-java setup on an ec2 node and >> >> >>>>> some >> >> >>>>> example benchmarking code we ran a while back, see: >> >> >>>>> https://github.com/shivaram/matrix-bench >> >> >>>>> >> >> >>>>> In particular - build-openblas-ec2.sh shows you how to build the >> >> >>>>> library >> >> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you >> >> >>>>> how >> >> >>>>> to get >> >> >>>>> the path setup and get that library picked up by netlib-java. >> >> >>>>> >> >> >>>>> In this way - you could probably get cuBLAS set up to be used by >> >> >>>>> netlib-java as well. >> >> >>>>> >> >> >>>>> - Evan >> >> >>>>> >> >> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < >> >> >>>>> >> >> >>>>> >> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >> >> >>>>> wrote: >> >> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java >> >> >>>>> to >> >> >>>>> force >> >> >>>>> loading the right blas? For netlib, I there are few JVM flags, >> >> >>>>> such >> >> >>>>> as >> >> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, >> >> >>>>> so >> >> >>>>> I can >> >> >>>>> force it to use Java implementation. Not sure I understand how to >> >> >>>>> force use >> >> >>>>> a specific blas (not specific wrapper for blas). >> >> >>>>> >> >> >>>>> Btw. I have installed openblas (yum install openblas), so I >> >> >>>>> suppose >> >> >>>>> that >> >> >>>>> netlib is using it. >> >> >>>>> >> >> >>>>> From: Evan R. Sparks >> >> >>>>> >> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >> >> >>>>> Sent: Friday, February 06, 2015 5:19 PM >> >> >>>>> To: Ulanov, Alexander >> >> >>>>> Cc: Joseph Bradley; >> >> >>>>> >> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>> >> >> >>>>> >> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> >>>>> >> >> >>>>> Getting breeze to pick up the right blas library is critical for >> >> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already >> >> >>>>> have >> >> >>>>> it). >> >> >>>>> It might make sense to force BIDMat to use the same underlying >> >> >>>>> BLAS >> >> >>>>> library >> >> >>>>> as well. >> >> >>>>> >> >> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < >> >> >>>>> >> >> >>>>> >> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >> >> >>>>> wrote: >> >> >>>>> Hi Evan, Joseph >> >> >>>>> >> >> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x >> >> >>>>> faster >> >> >>>>> than netlib-java+breeze (sorry for weird table formatting): >> >> >>>>> >> >> >>>>> |A*B size | BIDMat MKL | Breeze+Netlib-java >> >> >>>>> native_system_linux_x86-64| >> >> >>>>> Breeze+Netlib-java f2jblas | >> >> >>>>> >> >> >>>>> >> >> >>>>> +-----------------------------------------------------------------------+ >> >> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | >> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | >> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | >> >> >>>>> 1569,233228 | >> >> >>>>> >> >> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, >> >> >>>>> Fedora >> >> >>>>> 19 >> >> >>>>> Linux, Scala 2.11. >> >> >>>>> >> >> >>>>> Later I will make tests with Cuda. I need to install new Cuda >> >> >>>>> version for >> >> >>>>> this purpose. >> >> >>>>> >> >> >>>>> Do you have any ideas why breeze-netlib with native blas is so >> >> >>>>> much >> >> >>>>> slower than BIDMat MKL? >> >> >>>>> >> >> >>>>> Best regards, Alexander >> >> >>>>> >> >> >>>>> From: Joseph Bradley >> >> >>>>> >> >> >>>>> [mailto:jos...@databricks.com<mailto:jos...@databricks.com><mailto: >> >> >>>>> jos...@databricks.com<mailto:jos...@databricks.com>>] >> >> >>>>> Sent: Thursday, February 05, 2015 5:29 PM >> >> >>>>> To: Ulanov, Alexander >> >> >>>>> Cc: Evan R. Sparks; >> >> >>>>> >> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>> >> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> >>>>> >> >> >>>>> Hi Alexander, >> >> >>>>> >> >> >>>>> Using GPUs with Spark would be very exciting. Small comment: >> >> >>>>> Concerning >> >> >>>>> your question earlier about keeping data stored on the GPU rather >> >> >>>>> than >> >> >>>>> having to move it between main memory and GPU memory on each >> >> >>>>> iteration, I >> >> >>>>> would guess this would be critical to getting good performance. >> >> >>>>> If >> >> >>>>> you >> >> >>>>> could do multiple local iterations before aggregating results, >> >> >>>>> then >> >> >>>>> the >> >> >>>>> cost of data movement to the GPU could be amortized (and I >> >> >>>>> believe >> >> >>>>> that is >> >> >>>>> done in practice). Having Spark be aware of the GPU and using it >> >> >>>>> as >> >> >>>>> another part of memory sounds like a much bigger undertaking. >> >> >>>>> >> >> >>>>> Joseph >> >> >>>>> >> >> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < >> >> >>>>> >> >> >>>>> >> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >> >> >>>>> wrote: >> >> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation >> >> >>>>> by >> >> >>>>> John >> >> >>>>> Canny and I am really inspired by his talk and comparisons with >> >> >>>>> Spark MLlib. >> >> >>>>> >> >> >>>>> I am very interested to find out what will be better within >> >> >>>>> Spark: >> >> >>>>> BIDMat >> >> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair >> >> >>>>> way >> >> >>>>> to >> >> >>>>> benchmark them? Currently I do benchmarks on artificial neural >> >> >>>>> networks in >> >> >>>>> batch mode. While it is not a “pure” test of linear algebra, it >> >> >>>>> involves >> >> >>>>> some other things that are essential to machine learning. >> >> >>>>> >> >> >>>>> From: Evan R. Sparks >> >> >>>>> >> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >> >> >>>>> Sent: Thursday, February 05, 2015 1:29 PM >> >> >>>>> To: Ulanov, Alexander >> >> >>>>> Cc: >> >> >>>>> >> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>> >> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> >>>>> >> >> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than >> >> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due >> >> >>>>> to >> >> >>>>> data >> >> >>>>> layout and fewer levels of indirection - it's definitely a >> >> >>>>> worthwhile >> >> >>>>> experiment to run. The main speedups I've seen from using it come >> >> >>>>> from >> >> >>>>> highly optimized GPU code for linear algebra. I know that in the >> >> >>>>> past Canny >> >> >>>>> has gone as far as to write custom GPU kernels for >> >> >>>>> performance-critical >> >> >>>>> regions of code.[1] >> >> >>>>> >> >> >>>>> BIDMach is highly optimized for single node performance or >> >> >>>>> performance on >> >> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or >> >> >>>>> can be >> >> >>>>> batched in that way) the performance tends to fall off. Canny >> >> >>>>> argues >> >> >>>>> for >> >> >>>>> hardware/software codesign and as such prefers machine >> >> >>>>> configurations that >> >> >>>>> are quite different than what we find in most commodity cluster >> >> >>>>> nodes - >> >> >>>>> e.g. 10 disk cahnnels and 4 GPUs. >> >> >>>>> >> >> >>>>> In contrast, MLlib was designed for horizontal scalability on >> >> >>>>> commodity >> >> >>>>> clusters and works best on very big datasets - order of >> >> >>>>> terabytes. >> >> >>>>> >> >> >>>>> For the most part, these projects developed concurrently to >> >> >>>>> address >> >> >>>>> slightly different use cases. That said, there may be bits of >> >> >>>>> BIDMach we >> >> >>>>> could repurpose for MLlib - keep in mind we need to be careful >> >> >>>>> about >> >> >>>>> maintaining cross-language compatibility for our Java and >> >> >>>>> Python-users, >> >> >>>>> though. >> >> >>>>> >> >> >>>>> - Evan >> >> >>>>> >> >> >>>>> [1] - http://arxiv.org/abs/1409.5402 >> >> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf >> >> >>>>> >> >> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < >> >> >>>>> >> >> >>>>> >> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >> >> >>>>> >> >> >>>>> >> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> >> >> >>>>> wrote: >> >> >>>>> Hi Evan, >> >> >>>>> >> >> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do >> >> >>>>> you >> >> >>>>> know what makes them faster than netlib-java? >> >> >>>>> >> >> >>>>> The same group has BIDMach library that implements machine >> >> >>>>> learning. >> >> >>>>> For >> >> >>>>> some examples they use Caffe convolutional neural network library >> >> >>>>> owned by >> >> >>>>> another group in Berkeley. Could you elaborate on how these all >> >> >>>>> might be >> >> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra >> >> >>>>> why don’t >> >> >>>>> you take BIDMach for optimization and learning? >> >> >>>>> >> >> >>>>> Best regards, Alexander >> >> >>>>> >> >> >>>>> From: Evan R. Sparks >> >> >>>>> >> >> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >> >> >>>>> >> >> >>>>> >> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >> >> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>] >> >> >>>>> Sent: Thursday, February 05, 2015 12:09 PM >> >> >>>>> To: Ulanov, Alexander >> >> >>>>> Cc: >> >> >>>>> >> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto: >> >> >>>>> >> >> >>>>> >> >> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>> >> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> >>>>> >> >> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU >> >> >>>>> blas in >> >> >>>>> many cases. >> >> >>>>> >> >> >>>>> You might consider taking a look at the codepaths that BIDMat ( >> >> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to >> >> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work >> >> >>>>> optimizing >> >> >>>>> to make this work really fast from Scala. I've run it on my >> >> >>>>> laptop >> >> >>>>> and >> >> >>>>> compared to MKL and in certain cases it's 10x faster at matrix >> >> >>>>> multiply. >> >> >>>>> There are a lot of layers of indirection here and you really want >> >> >>>>> to >> >> >>>>> avoid >> >> >>>>> data copying as much as possible. >> >> >>>>> >> >> >>>>> We could also consider swapping out BIDMat for Breeze, but that >> >> >>>>> would be >> >> >>>>> a big project and if we can figure out how to get breeze+cublas >> >> >>>>> to >> >> >>>>> comparable performance that would be a big win. >> >> >>>>> >> >> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < >> >> >>>>> >> >> >>>>> >> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >> >> >>>>> >> >> >>>>> >> >> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> >> >> >>>>> wrote: >> >> >>>>> Dear Spark developers, >> >> >>>>> >> >> >>>>> I am exploring how to make linear algebra operations faster >> >> >>>>> within >> >> >>>>> Spark. >> >> >>>>> One way of doing this is to use Scala Breeze library that is >> >> >>>>> bundled >> >> >>>>> with >> >> >>>>> Spark. For matrix operations, it employs Netlib-java that has a >> >> >>>>> Java >> >> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK >> >> >>>>> native >> >> >>>>> binaries if they are available on the worker node. It also has >> >> >>>>> its >> >> >>>>> own >> >> >>>>> optimized Java implementation of BLAS. It is worth mentioning, >> >> >>>>> that >> >> >>>>> native >> >> >>>>> binaries provide better performance only for BLAS level 3, i.e. >> >> >>>>> matrix-matrix operations or general matrix multiplication (GEMM). >> >> >>>>> This is >> >> >>>>> confirmed by GEMM test on Netlib-java page >> >> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with >> >> >>>>> my >> >> >>>>> experiments with training of artificial neural network >> >> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952. >> >> >>>>> However, I would like to boost performance more. >> >> >>>>> >> >> >>>>> GPU is supposed to work fast with linear algebra and there is >> >> >>>>> Nvidia >> >> >>>>> CUDA >> >> >>>>> implementation of BLAS, called cublas. I have one Linux server >> >> >>>>> with >> >> >>>>> Nvidia >> >> >>>>> GPU and I was able to do the following. I linked cublas (instead >> >> >>>>> of >> >> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, >> >> >>>>> so >> >> >>>>> Breeze/Netlib is using it. Then I did some performance >> >> >>>>> measurements >> >> >>>>> with >> >> >>>>> regards to artificial neural network batch learning in Spark >> >> >>>>> MLlib >> >> >>>>> that >> >> >>>>> involves matrix-matrix multiplications. It turns out that for >> >> >>>>> matrices of >> >> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU >> >> >>>>> blas. >> >> >>>>> Cublas >> >> >>>>> becomes slower for bigger matrices. It worth mentioning that it >> >> >>>>> is >> >> >>>>> was not >> >> >>>>> a test for ONLY multiplication since there are other operations >> >> >>>>> involved. >> >> >>>>> One of the reasons for slowdown might be the overhead of copying >> >> >>>>> the >> >> >>>>> matrices from computer memory to graphic card memory and back. >> >> >>>>> >> >> >>>>> So, few questions: >> >> >>>>> 1) Do these results with CUDA make sense? >> >> >>>>> 2) If the problem is with copy overhead, are there any libraries >> >> >>>>> that >> >> >>>>> allow to force intermediate results to stay in graphic card >> >> >>>>> memory >> >> >>>>> thus >> >> >>>>> removing the overhead? >> >> >>>>> 3) Any other options to speed-up linear algebra in Spark? >> >> >>>>> >> >> >>>>> Thank you, Alexander >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> --------------------------------------------------------------------- >> >> >>>>> To unsubscribe, e-mail: >> >> >>>>> >> >> >>>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org><mailto: >> >> >>>>> >> >> >>>>> >> >> >>>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>><mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org> >> >> >>>>> >> >> >>>>> >> >> >>>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>>> >> >> >>>>> For additional commands, e-mail: >> >> >>>>> >> >> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: >> >> >>>>> >> >> >>>>> >> >> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: >> >> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>> >> >> > >> >> > -- >> >> > Best regards, >> >> > Sam >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org