Re: Using CUDA within Spark / boosting linear algebra

Xiangrui Meng Fri, 27 Feb 2015 12:27:04 -0800

Hey Sam,

The running times are not "big O" estimates:


> The CPU version finished in 12 seconds.
> The CPU->GPU->CPU version finished in 2.2 seconds.
> The GPU version finished in 1.7 seconds.

I think there is something wrong with the netlib/cublas combination.
Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
through the CPU BLAS interface we need to use NVBLAS, which intercepts
some Level 3 CPU BLAS calls (including GEMM). So we need to load
nvblas.so first and then some CPU BLAS library in JNI. I wonder
whether the setup was correct.

Alexander, could you check whether GPU is used in the netlib-cublas
experiments? You can tell it by watching CPU/GPU usage.

Best,
Xiangrui

On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday <sam.halli...@gmail.com> wrote:
> Don't use "big O" estimates, always measure. It used to work back in the
> days when double multiplication was a bottleneck. The computation cost is
> effectively free on both the CPU and GPU and you're seeing pure copying
> costs. Also, I'm dubious that cublas is doing what you think it is. Can you
> link me to the source code for DGEMM?
>
> I show all of this in my talk, with explanations, I can't stress enough how
> much I recommend that you watch it if you want to understand high
> performance hardware acceleration for linear algebra :-)
>
> On 27 Feb 2015 01:42, "Xiangrui Meng" <men...@gmail.com> wrote:
>>
>> The copying overhead should be quadratic on n, while the computation
>> cost is cubic on n. I can understand that netlib-cublas is slower than
>> netlib-openblas on small problems. But I'm surprised to see that it is
>> still 20x slower on 10000x10000. I did the following on a g2.2xlarge
>> instance with BIDMat:
>>
>> val n = 10000
>>
>> val f = rand(n, n)
>> flip; f*f; val rf = flop
>>
>> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop
>>
>> flip; g*g; val rgg = flop
>>
>> The CPU version finished in 12 seconds.
>> The CPU->GPU->CPU version finished in 2.2 seconds.
>> The GPU version finished in 1.7 seconds.
>>
>> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
>> path. But based on the result, the data copying overhead is definitely
>> not as big as 20x at n = 10000.
>>
>> Best,
>> Xiangrui
>>
>>
>> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <sam.halli...@gmail.com>
>> wrote:
>> > I've had some email exchanges with the author of BIDMat: it does exactly
>> > what you need to get the GPU benefit and writes higher level algorithms
>> > entirely in the GPU kernels so that the memory stays there as long as
>> > possible. The restriction with this approach is that it is only offering
>> > high-level algorithms so is not a toolkit for applied mathematics
>> > research and development --- but it works well as a toolkit for higher
>> > level analysis (e.g. for analysts and practitioners).
>> >
>> > I believe BIDMat's approach is the best way to get performance out of
>> > GPU hardware at the moment but I also have strong evidence to suggest
>> > that the hardware will catch up and the memory transfer costs between
>> > CPU/GPU will disappear meaning that there will be no need for custom GPU
>> > kernel implementations. i.e. please continue to use BLAS primitives when
>> > writing new algorithms and only go to the GPU for an alternative
>> > optimised implementation.
>> >
>> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
>> > an API that looks like BLAS but takes pointers to special regions in the
>> > GPU memory region. Somebody has written a wrapper around CUDA to create
>> > a proper BLAS library but it only gives marginal performance over the
>> > CPU because of the memory transfer overhead.
>> >
>> > This slide from my talk
>> >
>> >   http://fommil.github.io/scalax14/#/11/2
>> >
>> > says it all. X axis is matrix size, Y axis is logarithmic time to do
>> > DGEMM. Black line is the "cheating" time for the GPU and the green line
>> > is after copying the memory to/from the GPU memory. APUs have the
>> > potential to eliminate the green line.
>> >
>> > Best regards,
>> > Sam
>> >
>> >
>> >
>> > "Ulanov, Alexander" <alexander.ula...@hp.com> writes:
>> >
>> >> Evan, thank you for the summary. I would like to add some more
>> >> observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 
>> >> vs
>> >> $100). They both are 3 years old. I've also did a small test with modern
>> >> hardware, and the new GPU nVidia Titan was slightly more than 1 order of
>> >> magnitude faster than Intel E5-2650 v2 for the same tests. However, it 
>> >> costs
>> >> as much as CPU ($1200). My takeaway is that GPU is making a better
>> >> price/value progress.
>> >>
>> >>
>> >>
>> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than
>> >> netlib-cuda and the most reasonable explanation is that it holds the 
>> >> result
>> >> in GPU memory, as Sam suggested. At the same time, it is OK because you 
>> >> can
>> >> copy the result back from GPU only when needed. However, to be sure, I am
>> >> going to ask the developer of BIDMat on his upcoming talk.
>> >>
>> >>
>> >>
>> >> Best regards, Alexander
>> >>
>> >>
>> >> From: Sam Halliday [mailto:sam.halli...@gmail.com]
>> >> Sent: Thursday, February 26, 2015 1:56 PM
>> >> To: Xiangrui Meng
>> >> Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
>> >> Sparks
>> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>
>> >>
>> >> Btw, I wish people would stop cheating when comparing CPU and GPU
>> >> timings for things like matrix multiply :-P
>> >>
>> >> Please always compare apples with apples and include the time it takes
>> >> to set up the matrices, send it to the processing unit, doing the
>> >> calculation AND copying it back to where you need to see the results.
>> >>
>> >> Ignoring this method will make you believe that your GPU is thousands
>> >> of times faster than it really is. Again, jump to the end of my talk for
>> >> graphs and more discussion....  especially the bit about me being keen on
>> >> funding to investigate APU hardware further ;-) (I believe it will solve 
>> >> the
>> >> problem)
>> >> On 26 Feb 2015 21:16, "Xiangrui Meng"
>> >> <men...@gmail.com<mailto:men...@gmail.com>> wrote:
>> >> Hey Alexander,
>> >>
>> >> I don't quite understand the part where netlib-cublas is about 20x
>> >> slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> >> with netlib-java?
>> >>
>> >> CC'ed Sam, the author of netlib-java.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
>> >> <jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
>> >>> Better documentation for linking would be very helpful!  Here's a
>> >>> JIRA:
>> >>> https://issues.apache.org/jira/browse/SPARK-6019
>> >>>
>> >>>
>> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> >>> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>
>> >>> wrote:
>> >>>
>> >>>> Thanks for compiling all the data and running these benchmarks, Alex.
>> >>>> The
>> >>>> big takeaways here can be seen with this chart:
>> >>>>
>> >>>>
>> >>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>>>
>> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >>>> BIDMat+GPU) can provide substantial (but less than an order of
>> >>>> magnitude)
>> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >>>> netlib-java+openblas-compiled).
>> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>> >>>> worse
>> >>>> than a well-tuned CPU implementation, particularly for larger
>> >>>> matrices.
>> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> >>>> basically agrees with the authors own benchmarks (
>> >>>> https://github.com/fommil/netlib-java)
>> >>>>
>> >>>> I think that most of our users are in a situation where using GPUs
>> >>>> may not
>> >>>> be practical - although we could consider having a good GPU backend
>> >>>> available as an option. However, *ALL* users of MLlib could benefit
>> >>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>> >>>> implementation. Perhaps we should consider updating the mllib guide
>> >>>> with a
>> >>>> more complete section for enabling high performance binaries on OSX
>> >>>> and
>> >>>> Linux? Or better, figure out a way for the system to fetch these
>> >>>> automatically.
>> >>>>
>> >>>> - Evan
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>> >>>>
>> >>>>> Just to summarize this thread, I was finally able to make all
>> >>>>> performance
>> >>>>> comparisons that we discussed. It turns out that:
>> >>>>> BIDMat-cublas>>BIDMat
>> >>>>>
>> >>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>> >>>>>
>> >>>>> Below is the link to the spreadsheet with full results.
>> >>>>>
>> >>>>>
>> >>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>>>
>> >>>>> One thing still needs exploration: does BIDMat-cublas perform
>> >>>>> copying
>> >>>>> to/from machine’s RAM?
>> >>>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Ulanov, Alexander
>> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>>>> To: Evan R. Sparks
>> >>>>> Cc: Joseph Bradley;
>> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
>> >>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>> >>>>> the
>> >>>>> original one discusses slightly different topic. I was able to link
>> >>>>> netlib
>> >>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked
>> >>>>> inside a
>> >>>>> 60MB library.
>> >>>>>
>> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>> >>>>>
>> >>>>> +-----------------------------------------------------------------------+
>> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>> >>>>> |1,638475459 |
>> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>> >>>>> 1569,233228 |
>> >>>>>
>> >>>>> It turn out that pre-compiled MKL is faster than precompiled
>> >>>>> OpenBlas on
>> >>>>> my machine. Probably, I’ll add two more columns with locally
>> >>>>> compiled
>> >>>>> openblas and cuda.
>> >>>>>
>> >>>>> Alexander
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
>> >>>>> Sent: Monday, February 09, 2015 6:06 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc: Joseph Bradley;
>> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>> >>>>> ticket? (Here's one:
>> >>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>> >>>>>
>> >>>>> It seems like this is going to be somewhat exploratory for a while
>> >>>>> (and
>> >>>>> there's probably only a handful of us who really care about fast
>> >>>>> linear
>> >>>>> algebra!)
>> >>>>>
>> >>>>> - Evan
>> >>>>>
>> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>> >>>>> wrote:
>> >>>>> Hi Evan,
>> >>>>>
>> >>>>> Thank you for explanation and useful link. I am going to build
>> >>>>> OpenBLAS,
>> >>>>> link it with Netlib-java and perform benchmark again.
>> >>>>>
>> >>>>> Do I understand correctly that BIDMat binaries contain statically
>> >>>>> linked
>> >>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat
>> >>>>> not
>> >>>>> having MKL BLAS installed on my server. If it is true, I wonder if
>> >>>>> it is OK
>> >>>>> because Intel sells this library. Nevertheless, it seems that in my
>> >>>>> case
>> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given
>> >>>>> that
>> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>> >>>>>
>> >>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
>> >>>>> as
>> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>> >>>>> (Netlib-java) interested to compare their libraries.
>> >>>>>
>> >>>>> Best regards, Alexander
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
>> >>>>> Sent: Friday, February 06, 2015 5:58 PM
>> >>>>>
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc: Joseph Bradley;
>> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> I would build OpenBLAS yourself, since good BLAS performance comes
>> >>>>> from
>> >>>>> getting cache sizes, etc. set up correctly for your particular
>> >>>>> hardware -
>> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we found
>> >>>>> that on
>> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>> >>>>> performance competitive with MKL.
>> >>>>>
>> >>>>> To make sure the right library is getting used, you have to make
>> >>>>> sure
>> >>>>> it's first on the search path - export
>> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>> >>>>>
>> >>>>> For some examples of getting netlib-java setup on an ec2 node and
>> >>>>> some
>> >>>>> example benchmarking code we ran a while back, see:
>> >>>>> https://github.com/shivaram/matrix-bench
>> >>>>>
>> >>>>> In particular - build-openblas-ec2.sh shows you how to build the
>> >>>>> library
>> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how
>> >>>>> to get
>> >>>>> the path setup and get that library picked up by netlib-java.
>> >>>>>
>> >>>>> In this way - you could probably get cuBLAS set up to be used by
>> >>>>> netlib-java as well.
>> >>>>>
>> >>>>> - Evan
>> >>>>>
>> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>> >>>>> wrote:
>> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
>> >>>>> force
>> >>>>> loading the right blas? For netlib, I there are few JVM flags, such
>> >>>>> as
>> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so
>> >>>>> I can
>> >>>>> force it to use Java implementation. Not sure I understand how to
>> >>>>> force use
>> >>>>> a specific blas (not specific wrapper for blas).
>> >>>>>
>> >>>>> Btw. I have installed openblas (yum install openblas), so I suppose
>> >>>>> that
>> >>>>> netlib is using it.
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
>> >>>>> Sent: Friday, February 06, 2015 5:19 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc: Joseph Bradley;
>> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
>> >>>>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> Getting breeze to pick up the right blas library is critical for
>> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already have
>> >>>>> it).
>> >>>>> It might make sense to force BIDMat to use the same underlying BLAS
>> >>>>> library
>> >>>>> as well.
>> >>>>>
>> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>> >>>>> wrote:
>> >>>>> Hi Evan, Joseph
>> >>>>>
>> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>> >>>>> faster
>> >>>>> than netlib-java+breeze (sorry for weird table formatting):
>> >>>>>
>> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
>> >>>>> native_system_linux_x86-64|
>> >>>>> Breeze+Netlib-java f2jblas |
>> >>>>>
>> >>>>> +-----------------------------------------------------------------------+
>> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>> >>>>>
>> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
>> >>>>> 19
>> >>>>> Linux, Scala 2.11.
>> >>>>>
>> >>>>> Later I will make tests with Cuda. I need to install new Cuda
>> >>>>> version for
>> >>>>> this purpose.
>> >>>>>
>> >>>>> Do you have any ideas why breeze-netlib with native blas is so much
>> >>>>> slower than BIDMat MKL?
>> >>>>>
>> >>>>> Best regards, Alexander
>> >>>>>
>> >>>>> From: Joseph Bradley
>> >>>>> [mailto:jos...@databricks.com<mailto:jos...@databricks.com><mailto:
>> >>>>> jos...@databricks.com<mailto:jos...@databricks.com>>]
>> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc: Evan R. Sparks;
>> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> Hi Alexander,
>> >>>>>
>> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
>> >>>>> Concerning
>> >>>>> your question earlier about keeping data stored on the GPU rather
>> >>>>> than
>> >>>>> having to move it between main memory and GPU memory on each
>> >>>>> iteration, I
>> >>>>> would guess this would be critical to getting good performance.  If
>> >>>>> you
>> >>>>> could do multiple local iterations before aggregating results, then
>> >>>>> the
>> >>>>> cost of data movement to the GPU could be amortized (and I believe
>> >>>>> that is
>> >>>>> done in practice).  Having Spark be aware of the GPU and using it as
>> >>>>> another part of memory sounds like a much bigger undertaking.
>> >>>>>
>> >>>>> Joseph
>> >>>>>
>> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>> >>>>> wrote:
>> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
>> >>>>> John
>> >>>>> Canny and I am really inspired by his talk and comparisons with
>> >>>>> Spark MLlib.
>> >>>>>
>> >>>>> I am very interested to find out what will be better within Spark:
>> >>>>> BIDMat
>> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way
>> >>>>> to
>> >>>>> benchmark them? Currently I do benchmarks on artificial neural
>> >>>>> networks in
>> >>>>> batch mode. While it is not a “pure” test of linear algebra, it
>> >>>>> involves
>> >>>>> some other things that are essential to machine learning.
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
>> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc:
>> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
>> >>>>> data
>> >>>>> layout and fewer levels of indirection - it's definitely a
>> >>>>> worthwhile
>> >>>>> experiment to run. The main speedups I've seen from using it come
>> >>>>> from
>> >>>>> highly optimized GPU code for linear algebra. I know that in the
>> >>>>> past Canny
>> >>>>> has gone as far as to write custom GPU kernels for
>> >>>>> performance-critical
>> >>>>> regions of code.[1]
>> >>>>>
>> >>>>> BIDMach is highly optimized for single node performance or
>> >>>>> performance on
>> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or
>> >>>>> can be
>> >>>>> batched in that way) the performance tends to fall off. Canny argues
>> >>>>> for
>> >>>>> hardware/software codesign and as such prefers machine
>> >>>>> configurations that
>> >>>>> are quite different than what we find in most commodity cluster
>> >>>>> nodes -
>> >>>>> e.g. 10 disk cahnnels and 4 GPUs.
>> >>>>>
>> >>>>> In contrast, MLlib was designed for horizontal scalability on
>> >>>>> commodity
>> >>>>> clusters and works best on very big datasets - order of terabytes.
>> >>>>>
>> >>>>> For the most part, these projects developed concurrently to address
>> >>>>> slightly different use cases. That said, there may be bits of
>> >>>>> BIDMach we
>> >>>>> could repurpose for MLlib - keep in mind we need to be careful about
>> >>>>> maintaining cross-language compatibility for our Java and
>> >>>>> Python-users,
>> >>>>> though.
>> >>>>>
>> >>>>> - Evan
>> >>>>>
>> >>>>> [1] - http://arxiv.org/abs/1409.5402
>> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>> >>>>>
>> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>> >>>>>
>> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
>> >>>>>
>> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>>
>> >>>>> wrote:
>> >>>>> Hi Evan,
>> >>>>>
>> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>> >>>>> you
>> >>>>> know what makes them faster than netlib-java?
>> >>>>>
>> >>>>> The same group has BIDMach library that implements machine learning.
>> >>>>> For
>> >>>>> some examples they use Caffe convolutional neural network library
>> >>>>> owned by
>> >>>>> another group in Berkeley. Could you elaborate on how these all
>> >>>>> might be
>> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra
>> >>>>> why don’t
>> >>>>> you take BIDMach for optimization and learning?
>> >>>>>
>> >>>>> Best regards, Alexander
>> >>>>>
>> >>>>> From: Evan R. Sparks
>> >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>> >>>>>
>> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>]
>> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
>> >>>>> To: Ulanov, Alexander
>> >>>>> Cc:
>> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto:
>> >>>>>
>> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>>
>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>> >>>>>
>> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>> >>>>> blas in
>> >>>>> many cases.
>> >>>>>
>> >>>>> You might consider taking a look at the codepaths that BIDMat (
>> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>> >>>>> optimizing
>> >>>>> to make this work really fast from Scala. I've run it on my laptop
>> >>>>> and
>> >>>>> compared to MKL and in certain cases it's 10x faster at matrix
>> >>>>> multiply.
>> >>>>> There are a lot of layers of indirection here and you really want to
>> >>>>> avoid
>> >>>>> data copying as much as possible.
>> >>>>>
>> >>>>> We could also consider swapping out BIDMat for Breeze, but that
>> >>>>> would be
>> >>>>> a big project and if we can figure out how to get breeze+cublas to
>> >>>>> comparable performance that would be a big win.
>> >>>>>
>> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>> >>>>>
>> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
>> >>>>>
>> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>>
>> >>>>> wrote:
>> >>>>> Dear Spark developers,
>> >>>>>
>> >>>>> I am exploring how to make linear algebra operations faster within
>> >>>>> Spark.
>> >>>>> One way of doing this is to use Scala Breeze library that is bundled
>> >>>>> with
>> >>>>> Spark. For matrix operations, it employs Netlib-java that has a Java
>> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK
>> >>>>> native
>> >>>>> binaries if they are available on the worker node. It also has its
>> >>>>> own
>> >>>>> optimized Java implementation of BLAS. It is worth mentioning, that
>> >>>>> native
>> >>>>> binaries provide better performance only for BLAS level 3, i.e.
>> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>> >>>>> This is
>> >>>>> confirmed by GEMM test on Netlib-java page
>> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>> >>>>> experiments with training of artificial neural network
>> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>> >>>>> However, I would like to boost performance more.
>> >>>>>
>> >>>>> GPU is supposed to work fast with linear algebra and there is Nvidia
>> >>>>> CUDA
>> >>>>> implementation of BLAS, called cublas. I have one Linux server with
>> >>>>> Nvidia
>> >>>>> GPU and I was able to do the following. I linked cublas (instead of
>> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>> >>>>> Breeze/Netlib is using it. Then I did some performance measurements
>> >>>>> with
>> >>>>> regards to artificial neural network batch learning in Spark MLlib
>> >>>>> that
>> >>>>> involves matrix-matrix multiplications. It turns out that for
>> >>>>> matrices of
>> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas.
>> >>>>> Cublas
>> >>>>> becomes slower for bigger matrices. It worth mentioning that it is
>> >>>>> was not
>> >>>>> a test for ONLY multiplication since there are other operations
>> >>>>> involved.
>> >>>>> One of the reasons for slowdown might be the overhead of copying the
>> >>>>> matrices from computer memory to graphic card memory and back.
>> >>>>>
>> >>>>> So, few questions:
>> >>>>> 1) Do these results with CUDA make sense?
>> >>>>> 2) If the problem is with copy overhead, are there any libraries
>> >>>>> that
>> >>>>> allow to force intermediate results to stay in graphic card memory
>> >>>>> thus
>> >>>>> removing the overhead?
>> >>>>> 3) Any other options to speed-up linear algebra in Spark?
>> >>>>>
>> >>>>> Thank you, Alexander
>> >>>>>
>> >>>>>
>> >>>>> ---------------------------------------------------------------------
>> >>>>> To unsubscribe, e-mail:
>> >>>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org><mailto:
>> >>>>>
>> >>>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>><mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
>> >>>>>
>> >>>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>>>
>> >>>>> For additional commands, e-mail:
>> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:
>> >>>>>
>> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:
>> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >
>> > --
>> > Best regards,
>> > Sam
>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

Reply via email to