RE: Using CUDA within Spark / boosting linear algebra

Ulanov, Alexander Mon, 02 Mar 2015 13:08:15 -0800

Hi Xiangrui,

Thanks for the link, I am currently trying to use nvblas. It seems that netlib 
wrappers are implemented with C-BLAS interface and nvblas does not have c-blas. 
I wonder how it is going to work. I'll keep you updated.


Alexander

-----Original Message-----
From: Xiangrui Meng [mailto:[email protected]] 
Sent: Monday, March 02, 2015 11:42 AM
To: Sam Halliday
Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra

On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday <[email protected]> wrote:
> Also, check the JNILoader output.
>
> Remember, for netlib-java to use your system libblas all you need to 
> do is setup libblas.so.3 like any native application would expect.
>
> I haven't ever used the cublas "real BLAS"  implementation, so I'd be 
> interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to 
> check that all the runtime links are in order.
>

There are two shared libraries in this hybrid setup. nvblas.so must be loaded 
before libblas.so to intercept level 3 routines using GPU. More details are at: 
http://docs.nvidia.com/cuda/nvblas/index.html#Usage

> Btw, I have some DGEMM wrappers in my netlib-java performance 
> module... and I also planned to write more in MultiBLAS (until I 
> mothballed the project for the hardware to catch up, which is probably 
> has and now I just need a reason to look at it)
>
> On 27 Feb 2015 20:26, "Xiangrui Meng" <[email protected]> wrote:
>>
>> Hey Sam,
>>
>> The running times are not "big O" estimates:
>>
>> > The CPU version finished in 12 seconds.
>> > The CPU->GPU->CPU version finished in 2.2 seconds.
>> > The GPU version finished in 1.7 seconds.
>>
>> I think there is something wrong with the netlib/cublas combination.
>> Sam already mentioned that cuBLAS doesn't implement the CPU BLAS 
>> interfaces. I checked the CUDA doc and it seems that to use GPU BLAS 
>> through the CPU BLAS interface we need to use NVBLAS, which 
>> intercepts some Level 3 CPU BLAS calls (including GEMM). So we need 
>> to load nvblas.so first and then some CPU BLAS library in JNI. I 
>> wonder whether the setup was correct.
>>
>> Alexander, could you check whether GPU is used in the netlib-cublas 
>> experiments? You can tell it by watching CPU/GPU usage.
>>
>> Best,
>> Xiangrui
>>
>> On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday 
>> <[email protected]>
>> wrote:
>> > Don't use "big O" estimates, always measure. It used to work back 
>> > in the days when double multiplication was a bottleneck. The 
>> > computation cost is effectively free on both the CPU and GPU and 
>> > you're seeing pure copying costs. Also, I'm dubious that cublas is 
>> > doing what you think it is. Can you link me to the source code for 
>> > DGEMM?
>> >
>> > I show all of this in my talk, with explanations, I can't stress 
>> > enough how much I recommend that you watch it if you want to 
>> > understand high performance hardware acceleration for linear 
>> > algebra :-)
>> >
>> > On 27 Feb 2015 01:42, "Xiangrui Meng" <[email protected]> wrote:
>> >>
>> >> The copying overhead should be quadratic on n, while the 
>> >> computation cost is cubic on n. I can understand that 
>> >> netlib-cublas is slower than netlib-openblas on small problems. 
>> >> But I'm surprised to see that it is still 20x slower on 
>> >> 10000x10000. I did the following on a g2.2xlarge instance with BIDMat:
>> >>
>> >> val n = 10000
>> >>
>> >> val f = rand(n, n)
>> >> flip; f*f; val rf = flop
>> >>
>> >> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val 
>> >> rg = flop
>> >>
>> >> flip; g*g; val rgg = flop
>> >>
>> >> The CPU version finished in 12 seconds.
>> >> The CPU->GPU->CPU version finished in 2.2 seconds.
>> >> The GPU version finished in 1.7 seconds.
>> >>
>> >> I'm not sure whether my CPU->GPU->CPU code simulates the 
>> >> netlib-cublas path. But based on the result, the data copying 
>> >> overhead is definitely not as big as 20x at n = 10000.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >>
>> >> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday 
>> >> <[email protected]>
>> >> wrote:
>> >> > I've had some email exchanges with the author of BIDMat: it does 
>> >> > exactly what you need to get the GPU benefit and writes higher 
>> >> > level algorithms entirely in the GPU kernels so that the memory 
>> >> > stays there as long as possible. The restriction with this 
>> >> > approach is that it is only offering high-level algorithms so is 
>> >> > not a toolkit for applied mathematics research and development 
>> >> > --- but it works well as a toolkit for higher level analysis 
>> >> > (e.g. for analysts and practitioners).
>> >> >
>> >> > I believe BIDMat's approach is the best way to get performance 
>> >> > out of GPU hardware at the moment but I also have strong 
>> >> > evidence to suggest that the hardware will catch up and the 
>> >> > memory transfer costs between CPU/GPU will disappear meaning 
>> >> > that there will be no need for custom GPU kernel 
>> >> > implementations. i.e. please continue to use BLAS primitives 
>> >> > when writing new algorithms and only go to the GPU for an 
>> >> > alternative optimised implementation.
>> >> >
>> >> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, 
>> >> > and offer an API that looks like BLAS but takes pointers to 
>> >> > special regions in the GPU memory region. Somebody has written a 
>> >> > wrapper around CUDA to create a proper BLAS library but it only 
>> >> > gives marginal performance over the CPU because of the memory 
>> >> > transfer overhead.
>> >> >
>> >> > This slide from my talk
>> >> >
>> >> >   http://fommil.github.io/scalax14/#/11/2
>> >> >
>> >> > says it all. X axis is matrix size, Y axis is logarithmic time 
>> >> > to do DGEMM. Black line is the "cheating" time for the GPU and 
>> >> > the green line is after copying the memory to/from the GPU 
>> >> > memory. APUs have the potential to eliminate the green line.
>> >> >
>> >> > Best regards,
>> >> > Sam
>> >> >
>> >> >
>> >> >
>> >> > "Ulanov, Alexander" <[email protected]> writes:
>> >> >
>> >> >> Evan, thank you for the summary. I would like to add some more 
>> >> >> observations. The GPU that I used is 2.5 times cheaper than the 
>> >> >> CPU
>> >> >> ($250 vs
>> >> >> $100). They both are 3 years old. I've also did a small test 
>> >> >> with modern hardware, and the new GPU nVidia Titan was slightly 
>> >> >> more than 1 order of magnitude faster than Intel E5-2650 v2 for 
>> >> >> the same tests. However, it costs as much as CPU ($1200). My 
>> >> >> takeaway is that GPU is making a better price/value progress.
>> >> >>
>> >> >>
>> >> >>
>> >> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than 
>> >> >> netlib-cuda and the most reasonable explanation is that it 
>> >> >> holds the result in GPU memory, as Sam suggested. At the same 
>> >> >> time, it is OK because you can copy the result back from GPU 
>> >> >> only when needed. However, to be sure, I am going to ask the 
>> >> >> developer of BIDMat on his upcoming talk.
>> >> >>
>> >> >>
>> >> >>
>> >> >> Best regards, Alexander
>> >> >>
>> >> >>
>> >> >> From: Sam Halliday [mailto:[email protected]]
>> >> >> Sent: Thursday, February 26, 2015 1:56 PM
>> >> >> To: Xiangrui Meng
>> >> >> Cc: [email protected]; Joseph Bradley; Ulanov, Alexander; Evan R.
>> >> >> Sparks
>> >> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>
>> >> >>
>> >> >> Btw, I wish people would stop cheating when comparing CPU and 
>> >> >> GPU timings for things like matrix multiply :-P
>> >> >>
>> >> >> Please always compare apples with apples and include the time 
>> >> >> it takes to set up the matrices, send it to the processing 
>> >> >> unit, doing the calculation AND copying it back to where you 
>> >> >> need to see the results.
>> >> >>
>> >> >> Ignoring this method will make you believe that your GPU is 
>> >> >> thousands of times faster than it really is. Again, jump to the 
>> >> >> end of my talk for graphs and more discussion....  especially 
>> >> >> the bit about me being keen on funding to investigate APU 
>> >> >> hardware further ;-) (I believe it will solve the
>> >> >> problem)
>> >> >> On 26 Feb 2015 21:16, "Xiangrui Meng"
>> >> >> <[email protected]<mailto:[email protected]>> wrote:
>> >> >> Hey Alexander,
>> >> >>
>> >> >> I don't quite understand the part where netlib-cublas is about 
>> >> >> 20x slower than netlib-openblas. What is the overhead of using 
>> >> >> a GPU BLAS with netlib-java?
>> >> >>
>> >> >> CC'ed Sam, the author of netlib-java.
>> >> >>
>> >> >> Best,
>> >> >> Xiangrui
>> >> >>
>> >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
>> >> >> <[email protected]<mailto:[email protected]>> wrote:
>> >> >>> Better documentation for linking would be very helpful!  
>> >> >>> Here's a
>> >> >>> JIRA:
>> >> >>> https://issues.apache.org/jira/browse/SPARK-6019
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
>> >> >>> <[email protected]<mailto:[email protected]>>
>> >> >>> wrote:
>> >> >>>
>> >> >>>> Thanks for compiling all the data and running these 
>> >> >>>> benchmarks, Alex.
>> >> >>>> The
>> >> >>>> big takeaways here can be seen with this chart:
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4
>> >> >>>> StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interac
>> >> >>>> tive
>> >> >>>>
>> >> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >> >>>> BIDMat+GPU) can provide substantial (but less than an order 
>> >> >>>> BIDMat+of
>> >> >>>> magnitude)
>> >> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL 
>> >> >>>> or
>> >> >>>> netlib-java+openblas-compiled).
>> >> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of 
>> >> >>>> magnitude worse than a well-tuned CPU implementation, 
>> >> >>>> particularly for larger matrices.
>> >> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib 
>> >> >>>> - this basically agrees with the authors own benchmarks (
>> >> >>>> https://github.com/fommil/netlib-java)
>> >> >>>>
>> >> >>>> I think that most of our users are in a situation where using 
>> >> >>>> GPUs may not be practical - although we could consider having 
>> >> >>>> a good GPU backend available as an option. However, *ALL* 
>> >> >>>> users of MLlib could benefit (potentially tremendously) from 
>> >> >>>> using a well-tuned CPU-based BLAS implementation. Perhaps we 
>> >> >>>> should consider updating the mllib guide with a more complete 
>> >> >>>> section for enabling high performance binaries on OSX and 
>> >> >>>> Linux? Or better, figure out a way for the system to fetch 
>> >> >>>> these automatically.
>> >> >>>>
>> >> >>>> - Evan
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < 
>> >> >>>> [email protected]<mailto:[email protected]>> wrote:
>> >> >>>>
>> >> >>>>> Just to summarize this thread, I was finally able to make 
>> >> >>>>> all performance comparisons that we discussed. It turns out 
>> >> >>>>> that:
>> >> >>>>> BIDMat-cublas>>BIDMat
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yu
>> >> >>>>> m-repo==netlib-cublas>netlib-blas>f2jblas
>> >> >>>>>
>> >> >>>>> Below is the link to the spreadsheet with full results.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeo
>> >> >>>>> uQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>> >> >>>>>
>> >> >>>>> One thing still needs exploration: does BIDMat-cublas 
>> >> >>>>> perform copying to/from machine’s RAM?
>> >> >>>>>
>> >> >>>>> -----Original Message-----
>> >> >>>>> From: Ulanov, Alexander
>> >> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >> >>>>> To: Evan R. Sparks
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>> [email protected]<mailto:[email protected]>
>> >> >>>>> Subject: RE: Using CUDA within Spark / boosting linear 
>> >> >>>>> algebra
>> >> >>>>>
>> >> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate 
>> >> >>>>> though the original one discusses slightly different topic. 
>> >> >>>>> I was able to link netlib with MKL from BIDMat binaries. 
>> >> >>>>> Indeed, MKL is statically linked inside a 60MB library.
>> >> >>>>>
>> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas 
>> >> >>>>> Breeze+|
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> +-----------------------------------------------------------------------+
>> >> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 
>> >> >>>>> |0,002556
>> >> >>>>> |
>> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 
>> >> >>>>> |0,51803557
>> >> >>>>> |1,638475459 |
>> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 
>> >> >>>>> ||445,0935211
>> >> >>>>> |
>> >> >>>>> 1569,233228 |
>> >> >>>>>
>> >> >>>>> It turn out that pre-compiled MKL is faster than precompiled 
>> >> >>>>> OpenBlas on my machine. Probably, I’ll add two more columns 
>> >> >>>>> with locally compiled openblas and cuda.
>> >> >>>>>
>> >> >>>>> Alexander
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>> [mailto:[email protected]<mailto:[email protected]>]
>> >> >>>>> Sent: Monday, February 09, 2015 6:06 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>> [email protected]<mailto:[email protected]>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Great - perhaps we can move this discussion off-list and onto a
>> >> >>>>> JIRA
>> >> >>>>> ticket? (Here's one:
>> >> >>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >> >>>>>
>> >> >>>>> It seems like this is going to be somewhat exploratory for a
>> >> >>>>> while
>> >> >>>>> (and
>> >> >>>>> there's probably only a handful of us who really care about fast
>> >> >>>>> linear
>> >> >>>>> algebra!)
>> >> >>>>>
>> >> >>>>> - Evan
>> >> >>>>>
>> >> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
>> >> >>>>> wrote:
>> >> >>>>> Hi Evan,
>> >> >>>>>
>> >> >>>>> Thank you for explanation and useful link. I am going to build
>> >> >>>>> OpenBLAS,
>> >> >>>>> link it with Netlib-java and perform benchmark again.
>> >> >>>>>
>> >> >>>>> Do I understand correctly that BIDMat binaries contain statically
>> >> >>>>> linked
>> >> >>>>> Intel MKL BLAS? It might be the reason why I am able to run
>> >> >>>>> BIDMat
>> >> >>>>> not
>> >> >>>>> having MKL BLAS installed on my server. If it is true, I wonder
>> >> >>>>> if
>> >> >>>>> it is OK
>> >> >>>>> because Intel sells this library. Nevertheless, it seems that in
>> >> >>>>> my
>> >> >>>>> case
>> >> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS
>> >> >>>>> given
>> >> >>>>> that
>> >> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI
>> >> >>>>> overheads.
>> >> >>>>>
>> >> >>>>> Though, it might be interesting to link Netlib-java with Intel
>> >> >>>>> MKL,
>> >> >>>>> as
>> >> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>> >> >>>>> (Netlib-java) interested to compare their libraries.
>> >> >>>>>
>> >> >>>>> Best regards, Alexander
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[email protected]<mailto:[email protected]><mailto:
>> >> >>>>> [email protected]<mailto:[email protected]>>]
>> >> >>>>> Sent: Friday, February 06, 2015 5:58 PM
>> >> >>>>>
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> I would build OpenBLAS yourself, since good BLAS performance
>> >> >>>>> comes
>> >> >>>>> from
>> >> >>>>> getting cache sizes, etc. set up correctly for your particular
>> >> >>>>> hardware -
>> >> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we
>> >> >>>>> found
>> >> >>>>> that on
>> >> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>> >> >>>>> performance competitive with MKL.
>> >> >>>>>
>> >> >>>>> To make sure the right library is getting used, you have to make
>> >> >>>>> sure
>> >> >>>>> it's first on the search path - export
>> >> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >> >>>>>
>> >> >>>>> For some examples of getting netlib-java setup on an ec2 node and
>> >> >>>>> some
>> >> >>>>> example benchmarking code we ran a while back, see:
>> >> >>>>> https://github.com/shivaram/matrix-bench
>> >> >>>>>
>> >> >>>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >> >>>>> library
>> >> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you
>> >> >>>>> how
>> >> >>>>> to get
>> >> >>>>> the path setup and get that library picked up by netlib-java.
>> >> >>>>>
>> >> >>>>> In this way - you could probably get cuBLAS set up to be used by
>> >> >>>>> netlib-java as well.
>> >> >>>>>
>> >> >>>>> - Evan
>> >> >>>>>
>> >> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
>> >> >>>>> wrote:
>> >> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java
>> >> >>>>> to
>> >> >>>>> force
>> >> >>>>> loading the right blas? For netlib, I there are few JVM flags,
>> >> >>>>> such
>> >> >>>>> as
>> >> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>> >> >>>>> so
>> >> >>>>> I can
>> >> >>>>> force it to use Java implementation. Not sure I understand how to
>> >> >>>>> force use
>> >> >>>>> a specific blas (not specific wrapper for blas).
>> >> >>>>>
>> >> >>>>> Btw. I have installed openblas (yum install openblas), so I
>> >> >>>>> suppose
>> >> >>>>> that
>> >> >>>>> netlib is using it.
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[email protected]<mailto:[email protected]><mailto:
>> >> >>>>> [email protected]<mailto:[email protected]>>]
>> >> >>>>> Sent: Friday, February 06, 2015 5:19 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Joseph Bradley;
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>
>> >> >>>>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Getting breeze to pick up the right blas library is critical for
>> >> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already
>> >> >>>>> have
>> >> >>>>> it).
>> >> >>>>> It might make sense to force BIDMat to use the same underlying
>> >> >>>>> BLAS
>> >> >>>>> library
>> >> >>>>> as well.
>> >> >>>>>
>> >> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
>> >> >>>>> wrote:
>> >> >>>>> Hi Evan, Joseph
>> >> >>>>>
>> >> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >> >>>>> faster
>> >> >>>>> than netlib-java+breeze (sorry for weird table formatting):
>> >> >>>>>
>> >> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >> >>>>> native_system_linux_x86-64|
>> >> >>>>> Breeze+Netlib-java f2jblas |
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> +-----------------------------------------------------------------------+
>> >> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 |
>> >> >>>>> 1569,233228 |
>> >> >>>>>
>> >> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM,
>> >> >>>>> Fedora
>> >> >>>>> 19
>> >> >>>>> Linux, Scala 2.11.
>> >> >>>>>
>> >> >>>>> Later I will make tests with Cuda. I need to install new Cuda
>> >> >>>>> version for
>> >> >>>>> this purpose.
>> >> >>>>>
>> >> >>>>> Do you have any ideas why breeze-netlib with native blas is so
>> >> >>>>> much
>> >> >>>>> slower than BIDMat MKL?
>> >> >>>>>
>> >> >>>>> Best regards, Alexander
>> >> >>>>>
>> >> >>>>> From: Joseph Bradley
>> >> >>>>>
>> >> >>>>> [mailto:[email protected]<mailto:[email protected]><mailto:
>> >> >>>>> [email protected]<mailto:[email protected]>>]
>> >> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc: Evan R. Sparks;
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> Hi Alexander,
>> >> >>>>>
>> >> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >> >>>>> Concerning
>> >> >>>>> your question earlier about keeping data stored on the GPU rather
>> >> >>>>> than
>> >> >>>>> having to move it between main memory and GPU memory on each
>> >> >>>>> iteration, I
>> >> >>>>> would guess this would be critical to getting good performance.
>> >> >>>>> If
>> >> >>>>> you
>> >> >>>>> could do multiple local iterations before aggregating results,
>> >> >>>>> then
>> >> >>>>> the
>> >> >>>>> cost of data movement to the GPU could be amortized (and I
>> >> >>>>> believe
>> >> >>>>> that is
>> >> >>>>> done in practice).  Having Spark be aware of the GPU and using it
>> >> >>>>> as
>> >> >>>>> another part of memory sounds like a much bigger undertaking.
>> >> >>>>>
>> >> >>>>> Joseph
>> >> >>>>>
>> >> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
>> >> >>>>> wrote:
>> >> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation
>> >> >>>>> by
>> >> >>>>> John
>> >> >>>>> Canny and I am really inspired by his talk and comparisons with
>> >> >>>>> Spark MLlib.
>> >> >>>>>
>> >> >>>>> I am very interested to find out what will be better within
>> >> >>>>> Spark:
>> >> >>>>> BIDMat
>> >> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair
>> >> >>>>> way
>> >> >>>>> to
>> >> >>>>> benchmark them? Currently I do benchmarks on artificial neural
>> >> >>>>> networks in
>> >> >>>>> batch mode. While it is not a “pure” test of linear algebra, it
>> >> >>>>> involves
>> >> >>>>> some other things that are essential to machine learning.
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[email protected]<mailto:[email protected]><mailto:
>> >> >>>>> [email protected]<mailto:[email protected]>>]
>> >> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc:
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due
>> >> >>>>> to
>> >> >>>>> data
>> >> >>>>> layout and fewer levels of indirection - it's definitely a
>> >> >>>>> worthwhile
>> >> >>>>> experiment to run. The main speedups I've seen from using it come
>> >> >>>>> from
>> >> >>>>> highly optimized GPU code for linear algebra. I know that in the
>> >> >>>>> past Canny
>> >> >>>>> has gone as far as to write custom GPU kernels for
>> >> >>>>> performance-critical
>> >> >>>>> regions of code.[1]
>> >> >>>>>
>> >> >>>>> BIDMach is highly optimized for single node performance or
>> >> >>>>> performance on
>> >> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or
>> >> >>>>> can be
>> >> >>>>> batched in that way) the performance tends to fall off. Canny
>> >> >>>>> argues
>> >> >>>>> for
>> >> >>>>> hardware/software codesign and as such prefers machine
>> >> >>>>> configurations that
>> >> >>>>> are quite different than what we find in most commodity cluster
>> >> >>>>> nodes -
>> >> >>>>> e.g. 10 disk cahnnels and 4 GPUs.
>> >> >>>>>
>> >> >>>>> In contrast, MLlib was designed for horizontal scalability on
>> >> >>>>> commodity
>> >> >>>>> clusters and works best on very big datasets - order of
>> >> >>>>> terabytes.
>> >> >>>>>
>> >> >>>>> For the most part, these projects developed concurrently to
>> >> >>>>> address
>> >> >>>>> slightly different use cases. That said, there may be bits of
>> >> >>>>> BIDMach we
>> >> >>>>> could repurpose for MLlib - keep in mind we need to be careful
>> >> >>>>> about
>> >> >>>>> maintaining cross-language compatibility for our Java and
>> >> >>>>> Python-users,
>> >> >>>>> though.
>> >> >>>>>
>> >> >>>>> - Evan
>> >> >>>>>
>> >> >>>>> [1] - http://arxiv.org/abs/1409.5402
>> >> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >> >>>>>
>> >> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>>
>> >> >>>>> wrote:
>> >> >>>>> Hi Evan,
>> >> >>>>>
>> >> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>> >> >>>>> you
>> >> >>>>> know what makes them faster than netlib-java?
>> >> >>>>>
>> >> >>>>> The same group has BIDMach library that implements machine
>> >> >>>>> learning.
>> >> >>>>> For
>> >> >>>>> some examples they use Caffe convolutional neural network library
>> >> >>>>> owned by
>> >> >>>>> another group in Berkeley. Could you elaborate on how these all
>> >> >>>>> might be
>> >> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra
>> >> >>>>> why don’t
>> >> >>>>> you take BIDMach for optimization and learning?
>> >> >>>>>
>> >> >>>>> Best regards, Alexander
>> >> >>>>>
>> >> >>>>> From: Evan R. Sparks
>> >> >>>>>
>> >> >>>>> [mailto:[email protected]<mailto:[email protected]><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:
>> >> >>>>> [email protected]<mailto:[email protected]>>>]
>> >> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >> >>>>> To: Ulanov, Alexander
>> >> >>>>> Cc:
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
>> >> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >> >>>>>
>> >> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >> >>>>> blas in
>> >> >>>>> many cases.
>> >> >>>>>
>> >> >>>>> You might consider taking a look at the codepaths that BIDMat (
>> >> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >> >>>>> optimizing
>> >> >>>>> to make this work really fast from Scala. I've run it on my
>> >> >>>>> laptop
>> >> >>>>> and
>> >> >>>>> compared to MKL and in certain cases it's 10x faster at matrix
>> >> >>>>> multiply.
>> >> >>>>> There are a lot of layers of indirection here and you really want
>> >> >>>>> to
>> >> >>>>> avoid
>> >> >>>>> data copying as much as possible.
>> >> >>>>>
>> >> >>>>> We could also consider swapping out BIDMat for Breeze, but that
>> >> >>>>> would be
>> >> >>>>> a big project and if we can figure out how to get breeze+cublas
>> >> >>>>> to
>> >> >>>>> comparable performance that would be a big win.
>> >> >>>>>
>> >> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>>
>> >> >>>>> wrote:
>> >> >>>>> Dear Spark developers,
>> >> >>>>>
>> >> >>>>> I am exploring how to make linear algebra operations faster
>> >> >>>>> within
>> >> >>>>> Spark.
>> >> >>>>> One way of doing this is to use Scala Breeze library that is
>> >> >>>>> bundled
>> >> >>>>> with
>> >> >>>>> Spark. For matrix operations, it employs Netlib-java that has a
>> >> >>>>> Java
>> >> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK
>> >> >>>>> native
>> >> >>>>> binaries if they are available on the worker node. It also has
>> >> >>>>> its
>> >> >>>>> own
>> >> >>>>> optimized Java implementation of BLAS. It is worth mentioning,
>> >> >>>>> that
>> >> >>>>> native
>> >> >>>>> binaries provide better performance only for BLAS level 3, i.e.
>> >> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >> >>>>> This is
>> >> >>>>> confirmed by GEMM test on Netlib-java page
>> >> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with
>> >> >>>>> my
>> >> >>>>> experiments with training of artificial neural network
>> >> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >> >>>>> However, I would like to boost performance more.
>> >> >>>>>
>> >> >>>>> GPU is supposed to work fast with linear algebra and there is
>> >> >>>>> Nvidia
>> >> >>>>> CUDA
>> >> >>>>> implementation of BLAS, called cublas. I have one Linux server
>> >> >>>>> with
>> >> >>>>> Nvidia
>> >> >>>>> GPU and I was able to do the following. I linked cublas (instead
>> >> >>>>> of
>> >> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark,
>> >> >>>>> so
>> >> >>>>> Breeze/Netlib is using it. Then I did some performance
>> >> >>>>> measurements
>> >> >>>>> with
>> >> >>>>> regards to artificial neural network batch learning in Spark
>> >> >>>>> MLlib
>> >> >>>>> that
>> >> >>>>> involves matrix-matrix multiplications. It turns out that for
>> >> >>>>> matrices of
>> >> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU
>> >> >>>>> blas.
>> >> >>>>> Cublas
>> >> >>>>> becomes slower for bigger matrices. It worth mentioning that it
>> >> >>>>> is
>> >> >>>>> was not
>> >> >>>>> a test for ONLY multiplication since there are other operations
>> >> >>>>> involved.
>> >> >>>>> One of the reasons for slowdown might be the overhead of copying
>> >> >>>>> the
>> >> >>>>> matrices from computer memory to graphic card memory and back.
>> >> >>>>>
>> >> >>>>> So, few questions:
>> >> >>>>> 1) Do these results with CUDA make sense?
>> >> >>>>> 2) If the problem is with copy overhead, are there any libraries
>> >> >>>>> that
>> >> >>>>> allow to force intermediate results to stay in graphic card
>> >> >>>>> memory
>> >> >>>>> thus
>> >> >>>>> removing the overhead?
>> >> >>>>> 3) Any other options to speed-up linear algebra in Spark?
>> >> >>>>>
>> >> >>>>> Thank you, Alexander
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> ---------------------------------------------------------------------
>> >> >>>>> To unsubscribe, e-mail:
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> <mailto:[email protected]<mailto:[email protected]>>>
>> >> >>>>> For additional commands, e-mail:
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]><mailto:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> [email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:
>> >> >>>>> [email protected]<mailto:[email protected]>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>
>> >> >
>> >> > --
>> >> > Best regards,
>> >> > Sam
>> >> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Using CUDA within Spark / boosting linear algebra

Reply via email to