Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Xiangrui Meng
Hey Sam,

The running times are not big O estimates:

 The CPU version finished in 12 seconds.
 The CPU-GPU-CPU version finished in 2.2 seconds.
 The GPU version finished in 1.7 seconds.

I think there is something wrong with the netlib/cublas combination.
Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
through the CPU BLAS interface we need to use NVBLAS, which intercepts
some Level 3 CPU BLAS calls (including GEMM). So we need to load
nvblas.so first and then some CPU BLAS library in JNI. I wonder
whether the setup was correct.

Alexander, could you check whether GPU is used in the netlib-cublas
experiments? You can tell it by watching CPU/GPU usage.

Best,
Xiangrui

On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday sam.halli...@gmail.com wrote:
 Don't use big O estimates, always measure. It used to work back in the
 days when double multiplication was a bottleneck. The computation cost is
 effectively free on both the CPU and GPU and you're seeing pure copying
 costs. Also, I'm dubious that cublas is doing what you think it is. Can you
 link me to the source code for DGEMM?

 I show all of this in my talk, with explanations, I can't stress enough how
 much I recommend that you watch it if you want to understand high
 performance hardware acceleration for linear algebra :-)

 On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:

 The copying overhead should be quadratic on n, while the computation
 cost is cubic on n. I can understand that netlib-cublas is slower than
 netlib-openblas on small problems. But I'm surprised to see that it is
 still 20x slower on 1x1. I did the following on a g2.2xlarge
 instance with BIDMat:

 val n = 1

 val f = rand(n, n)
 flip; f*f; val rf = flop

 flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop

 flip; g*g; val rgg = flop

 The CPU version finished in 12 seconds.
 The CPU-GPU-CPU version finished in 2.2 seconds.
 The GPU version finished in 1.7 seconds.

 I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas
 path. But based on the result, the data copying overhead is definitely
 not as big as 20x at n = 1.

 Best,
 Xiangrui


 On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com
 wrote:
  I've had some email exchanges with the author of BIDMat: it does exactly
  what you need to get the GPU benefit and writes higher level algorithms
  entirely in the GPU kernels so that the memory stays there as long as
  possible. The restriction with this approach is that it is only offering
  high-level algorithms so is not a toolkit for applied mathematics
  research and development --- but it works well as a toolkit for higher
  level analysis (e.g. for analysts and practitioners).
 
  I believe BIDMat's approach is the best way to get performance out of
  GPU hardware at the moment but I also have strong evidence to suggest
  that the hardware will catch up and the memory transfer costs between
  CPU/GPU will disappear meaning that there will be no need for custom GPU
  kernel implementations. i.e. please continue to use BLAS primitives when
  writing new algorithms and only go to the GPU for an alternative
  optimised implementation.
 
  Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
  an API that looks like BLAS but takes pointers to special regions in the
  GPU memory region. Somebody has written a wrapper around CUDA to create
  a proper BLAS library but it only gives marginal performance over the
  CPU because of the memory transfer overhead.
 
  This slide from my talk
 
http://fommil.github.io/scalax14/#/11/2
 
  says it all. X axis is matrix size, Y axis is logarithmic time to do
  DGEMM. Black line is the cheating time for the GPU and the green line
  is after copying the memory to/from the GPU memory. APUs have the
  potential to eliminate the green line.
 
  Best regards,
  Sam
 
 
 
  Ulanov, Alexander alexander.ula...@hp.com writes:
 
  Evan, thank you for the summary. I would like to add some more
  observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 
  vs
  $100). They both are 3 years old. I've also did a small test with modern
  hardware, and the new GPU nVidia Titan was slightly more than 1 order of
  magnitude faster than Intel E5-2650 v2 for the same tests. However, it 
  costs
  as much as CPU ($1200). My takeaway is that GPU is making a better
  price/value progress.
 
 
 
  Xiangrui, I was also surprised that BIDMat-cuda was faster than
  netlib-cuda and the most reasonable explanation is that it holds the 
  result
  in GPU memory, as Sam suggested. At the same time, it is OK because you 
  can
  copy the result back from GPU only when needed. However, to be sure, I am
  going to ask the developer of BIDMat on his upcoming talk.
 
 
 
  Best regards, Alexander
 
 
  From: Sam Halliday [mailto:sam.halli...@gmail.com]
  Sent: Thursday, February 

Re: trouble with sbt building network-* projects?

2015-02-27 Thread Imran Rashid
well, perhaps I just need to learn to use maven better, but currently I
find sbt much more convenient for continuously running my tests.  I do use
zinc, but I'm looking for continuous testing.  This makes me think I need
sbt for that:
http://stackoverflow.com/questions/11347633/is-there-a-java-continuous-testing-plugin-for-maven

1) I really like that in sbt I can run ~test-only
com.foo.bar.SomeTestSuite (or whatever other pattern) and just leave that
running as I code, without having to go and explicitly trigger mvn test
and wait for the result.

2) I find sbt's handling of sub-projects much simpler (when it works).  I'm
trying to make changes to network/common  network/shuffle, which means I
have to keep cd'ing into network/common, run mvn install, then go back to
network/shuffle and run some other mvn command over there.  I don't want to
run mvn at the root project level, b/c I don't want to wait for it to
compile all the other projects when I just want to run tests in
network/common.  Even with incremental compiling, in my day-to-day coding I
want to entirely skip compiling sql, graphx, mllib etc. -- I have to switch
branches often enough that i end up triggering a full rebuild of those
projects even when I haven't touched them.





On Fri, Feb 27, 2015 at 1:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. to be able to run my tests in sbt, though, it makes the development
 iterations much faster.

 Was the preference for sbt due to long maven build time ?
 Have you started Zinc on your machine ?

 Cheers

 On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid iras...@cloudera.com
 wrote:

 Has anyone else noticed very strange build behavior in the network-*
 projects?

 maven seems to the doing the right, but sbt is very inconsistent.
 Sometimes when it builds network-shuffle it doesn't know about any of the
 code in network-common.  Sometimes it will completely skip the java unit
 tests.  And then some time later, it'll suddenly decide it knows about
 some
 more of the java unit tests.  Its not from a simple change, like touching
 a
 test file, or a file the test depends on -- nor a restart of sbt.  I am
 pretty confused.


 maven had issues when I tried to add scala code to network-common, it
 would
 compile the scala code but not make it available to java.  I'm working
 around that by just coding in java anyhow.  I'd really like to be able to
 run my tests in sbt, though, it makes the development iterations much
 faster.

 thanks,
 Imran





Re: trouble with sbt building network-* projects?

2015-02-27 Thread Ted Yu
bq. to be able to run my tests in sbt, though, it makes the development
iterations much faster.

Was the preference for sbt due to long maven build time ?
Have you started Zinc on your machine ?

Cheers

On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid iras...@cloudera.com wrote:

 Has anyone else noticed very strange build behavior in the network-*
 projects?

 maven seems to the doing the right, but sbt is very inconsistent.
 Sometimes when it builds network-shuffle it doesn't know about any of the
 code in network-common.  Sometimes it will completely skip the java unit
 tests.  And then some time later, it'll suddenly decide it knows about some
 more of the java unit tests.  Its not from a simple change, like touching a
 test file, or a file the test depends on -- nor a restart of sbt.  I am
 pretty confused.


 maven had issues when I tried to add scala code to network-common, it would
 compile the scala code but not make it available to java.  I'm working
 around that by just coding in java anyhow.  I'd really like to be able to
 run my tests in sbt, though, it makes the development iterations much
 faster.

 thanks,
 Imran



trouble with sbt building network-* projects?

2015-02-27 Thread Imran Rashid
Has anyone else noticed very strange build behavior in the network-*
projects?

maven seems to the doing the right, but sbt is very inconsistent.
Sometimes when it builds network-shuffle it doesn't know about any of the
code in network-common.  Sometimes it will completely skip the java unit
tests.  And then some time later, it'll suddenly decide it knows about some
more of the java unit tests.  Its not from a simple change, like touching a
test file, or a file the test depends on -- nor a restart of sbt.  I am
pretty confused.


maven had issues when I tried to add scala code to network-common, it would
compile the scala code but not make it available to java.  I'm working
around that by just coding in java anyhow.  I'd really like to be able to
run my tests in sbt, though, it makes the development iterations much
faster.

thanks,
Imran


Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Sam Halliday
Also, check the JNILoader output.

Remember, for netlib-java to use your system libblas all you need to do is
setup libblas.so.3 like any native application would expect.

I haven't ever used the cublas real BLAS  implementation, so I'd be
interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check
that all the runtime links are in order.

Btw, I have some DGEMM wrappers in my netlib-java performance module... and
I also planned to write more in MultiBLAS (until I mothballed the project
for the hardware to catch up, which is probably has and now I just need a
reason to look at it)
 On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote:

 Hey Sam,

 The running times are not big O estimates:

  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.

 I think there is something wrong with the netlib/cublas combination.
 Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
 interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
 through the CPU BLAS interface we need to use NVBLAS, which intercepts
 some Level 3 CPU BLAS calls (including GEMM). So we need to load
 nvblas.so first and then some CPU BLAS library in JNI. I wonder
 whether the setup was correct.

 Alexander, could you check whether GPU is used in the netlib-cublas
 experiments? You can tell it by watching CPU/GPU usage.

 Best,
 Xiangrui

 On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday sam.halli...@gmail.com
 wrote:
  Don't use big O estimates, always measure. It used to work back in the
  days when double multiplication was a bottleneck. The computation cost is
  effectively free on both the CPU and GPU and you're seeing pure copying
  costs. Also, I'm dubious that cublas is doing what you think it is. Can
 you
  link me to the source code for DGEMM?
 
  I show all of this in my talk, with explanations, I can't stress enough
 how
  much I recommend that you watch it if you want to understand high
  performance hardware acceleration for linear algebra :-)
 
  On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote:
 
  The copying overhead should be quadratic on n, while the computation
  cost is cubic on n. I can understand that netlib-cublas is slower than
  netlib-openblas on small problems. But I'm surprised to see that it is
  still 20x slower on 1x1. I did the following on a g2.2xlarge
  instance with BIDMat:
 
  val n = 1
 
  val f = rand(n, n)
  flip; f*f; val rf = flop
 
  flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg =
 flop
 
  flip; g*g; val rgg = flop
 
  The CPU version finished in 12 seconds.
  The CPU-GPU-CPU version finished in 2.2 seconds.
  The GPU version finished in 1.7 seconds.
 
  I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas
  path. But based on the result, the data copying overhead is definitely
  not as big as 20x at n = 1.
 
  Best,
  Xiangrui
 
 
  On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com
  wrote:
   I've had some email exchanges with the author of BIDMat: it does
 exactly
   what you need to get the GPU benefit and writes higher level
 algorithms
   entirely in the GPU kernels so that the memory stays there as long as
   possible. The restriction with this approach is that it is only
 offering
   high-level algorithms so is not a toolkit for applied mathematics
   research and development --- but it works well as a toolkit for higher
   level analysis (e.g. for analysts and practitioners).
  
   I believe BIDMat's approach is the best way to get performance out of
   GPU hardware at the moment but I also have strong evidence to suggest
   that the hardware will catch up and the memory transfer costs between
   CPU/GPU will disappear meaning that there will be no need for custom
 GPU
   kernel implementations. i.e. please continue to use BLAS primitives
 when
   writing new algorithms and only go to the GPU for an alternative
   optimised implementation.
  
   Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and
 offer
   an API that looks like BLAS but takes pointers to special regions in
 the
   GPU memory region. Somebody has written a wrapper around CUDA to
 create
   a proper BLAS library but it only gives marginal performance over the
   CPU because of the memory transfer overhead.
  
   This slide from my talk
  
 http://fommil.github.io/scalax14/#/11/2
  
   says it all. X axis is matrix size, Y axis is logarithmic time to do
   DGEMM. Black line is the cheating time for the GPU and the green
 line
   is after copying the memory to/from the GPU memory. APUs have the
   potential to eliminate the green line.
  
   Best regards,
   Sam
  
  
  
   Ulanov, Alexander alexander.ula...@hp.com writes:
  
   Evan, thank you for the summary. I would like to add some more
   observations. The GPU that I used is 2.5 times cheaper than the CPU
 ($250 vs
   $100). They both are 3 years old. 

Re: trouble with sbt building network-* projects?

2015-02-27 Thread Ted Yu
bq. I have to keep cd'ing into network/common, run mvn install, then go
back to network/shuffle and run some other mvn command over there.

Yeah - been through this.

Having continuous testing for maven would be nice.

On Fri, Feb 27, 2015 at 11:31 AM, Imran Rashid iras...@cloudera.com wrote:

 well, perhaps I just need to learn to use maven better, but currently I
 find sbt much more convenient for continuously running my tests.  I do use
 zinc, but I'm looking for continuous testing.  This makes me think I need
 sbt for that:
 http://stackoverflow.com/questions/11347633/is-there-a-java-continuous-testing-plugin-for-maven

 1) I really like that in sbt I can run ~test-only
 com.foo.bar.SomeTestSuite (or whatever other pattern) and just leave that
 running as I code, without having to go and explicitly trigger mvn test
 and wait for the result.

 2) I find sbt's handling of sub-projects much simpler (when it works).
 I'm trying to make changes to network/common  network/shuffle, which means
 I have to keep cd'ing into network/common, run mvn install, then go back to
 network/shuffle and run some other mvn command over there.  I don't want to
 run mvn at the root project level, b/c I don't want to wait for it to
 compile all the other projects when I just want to run tests in
 network/common.  Even with incremental compiling, in my day-to-day coding I
 want to entirely skip compiling sql, graphx, mllib etc. -- I have to switch
 branches often enough that i end up triggering a full rebuild of those
 projects even when I haven't touched them.





 On Fri, Feb 27, 2015 at 1:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. to be able to run my tests in sbt, though, it makes the development
 iterations much faster.

 Was the preference for sbt due to long maven build time ?
 Have you started Zinc on your machine ?

 Cheers

 On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid iras...@cloudera.com
 wrote:

 Has anyone else noticed very strange build behavior in the network-*
 projects?

 maven seems to the doing the right, but sbt is very inconsistent.
 Sometimes when it builds network-shuffle it doesn't know about any of the
 code in network-common.  Sometimes it will completely skip the java unit
 tests.  And then some time later, it'll suddenly decide it knows about
 some
 more of the java unit tests.  Its not from a simple change, like
 touching a
 test file, or a file the test depends on -- nor a restart of sbt.  I am
 pretty confused.


 maven had issues when I tried to add scala code to network-common, it
 would
 compile the scala code but not make it available to java.  I'm working
 around that by just coding in java anyhow.  I'd really like to be able to
 run my tests in sbt, though, it makes the development iterations much
 faster.

 thanks,
 Imran