Re: Using CUDA within Spark / boosting linear algebra
Hey Sam, The running times are not big O estimates: The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I think there is something wrong with the netlib/cublas combination. Sam already mentioned that cuBLAS doesn't implement the CPU BLAS interfaces. I checked the CUDA doc and it seems that to use GPU BLAS through the CPU BLAS interface we need to use NVBLAS, which intercepts some Level 3 CPU BLAS calls (including GEMM). So we need to load nvblas.so first and then some CPU BLAS library in JNI. I wonder whether the setup was correct. Alexander, could you check whether GPU is used in the netlib-cublas experiments? You can tell it by watching CPU/GPU usage. Best, Xiangrui On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday sam.halli...@gmail.com wrote: Don't use big O estimates, always measure. It used to work back in the days when double multiplication was a bottleneck. The computation cost is effectively free on both the CPU and GPU and you're seeing pure copying costs. Also, I'm dubious that cublas is doing what you think it is. Can you link me to the source code for DGEMM? I show all of this in my talk, with explanations, I can't stress enough how much I recommend that you watch it if you want to understand high performance hardware acceleration for linear algebra :-) On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote: The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 1x1. I did the following on a g2.2xlarge instance with BIDMat: val n = 1 val f = rand(n, n) flip; f*f; val rf = flop flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop flip; g*g; val rgg = flop The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas path. But based on the result, the data copying overhead is definitely not as big as 20x at n = 1. Best, Xiangrui On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com wrote: I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level algorithms so is not a toolkit for applied mathematics research and development --- but it works well as a toolkit for higher level analysis (e.g. for analysts and practitioners). I believe BIDMat's approach is the best way to get performance out of GPU hardware at the moment but I also have strong evidence to suggest that the hardware will catch up and the memory transfer costs between CPU/GPU will disappear meaning that there will be no need for custom GPU kernel implementations. i.e. please continue to use BLAS primitives when writing new algorithms and only go to the GPU for an alternative optimised implementation. Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer an API that looks like BLAS but takes pointers to special regions in the GPU memory region. Somebody has written a wrapper around CUDA to create a proper BLAS library but it only gives marginal performance over the CPU because of the memory transfer overhead. This slide from my talk http://fommil.github.io/scalax14/#/11/2 says it all. X axis is matrix size, Y axis is logarithmic time to do DGEMM. Black line is the cheating time for the GPU and the green line is after copying the memory to/from the GPU memory. APUs have the potential to eliminate the green line. Best regards, Sam Ulanov, Alexander alexander.ula...@hp.com writes: Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude faster than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU ($1200). My takeaway is that GPU is making a better price/value progress. Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda and the most reasonable explanation is that it holds the result in GPU memory, as Sam suggested. At the same time, it is OK because you can copy the result back from GPU only when needed. However, to be sure, I am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February
Re: trouble with sbt building network-* projects?
well, perhaps I just need to learn to use maven better, but currently I find sbt much more convenient for continuously running my tests. I do use zinc, but I'm looking for continuous testing. This makes me think I need sbt for that: http://stackoverflow.com/questions/11347633/is-there-a-java-continuous-testing-plugin-for-maven 1) I really like that in sbt I can run ~test-only com.foo.bar.SomeTestSuite (or whatever other pattern) and just leave that running as I code, without having to go and explicitly trigger mvn test and wait for the result. 2) I find sbt's handling of sub-projects much simpler (when it works). I'm trying to make changes to network/common network/shuffle, which means I have to keep cd'ing into network/common, run mvn install, then go back to network/shuffle and run some other mvn command over there. I don't want to run mvn at the root project level, b/c I don't want to wait for it to compile all the other projects when I just want to run tests in network/common. Even with incremental compiling, in my day-to-day coding I want to entirely skip compiling sql, graphx, mllib etc. -- I have to switch branches often enough that i end up triggering a full rebuild of those projects even when I haven't touched them. On Fri, Feb 27, 2015 at 1:14 PM, Ted Yu yuzhih...@gmail.com wrote: bq. to be able to run my tests in sbt, though, it makes the development iterations much faster. Was the preference for sbt due to long maven build time ? Have you started Zinc on your machine ? Cheers On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid iras...@cloudera.com wrote: Has anyone else noticed very strange build behavior in the network-* projects? maven seems to the doing the right, but sbt is very inconsistent. Sometimes when it builds network-shuffle it doesn't know about any of the code in network-common. Sometimes it will completely skip the java unit tests. And then some time later, it'll suddenly decide it knows about some more of the java unit tests. Its not from a simple change, like touching a test file, or a file the test depends on -- nor a restart of sbt. I am pretty confused. maven had issues when I tried to add scala code to network-common, it would compile the scala code but not make it available to java. I'm working around that by just coding in java anyhow. I'd really like to be able to run my tests in sbt, though, it makes the development iterations much faster. thanks, Imran
Re: trouble with sbt building network-* projects?
bq. to be able to run my tests in sbt, though, it makes the development iterations much faster. Was the preference for sbt due to long maven build time ? Have you started Zinc on your machine ? Cheers On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid iras...@cloudera.com wrote: Has anyone else noticed very strange build behavior in the network-* projects? maven seems to the doing the right, but sbt is very inconsistent. Sometimes when it builds network-shuffle it doesn't know about any of the code in network-common. Sometimes it will completely skip the java unit tests. And then some time later, it'll suddenly decide it knows about some more of the java unit tests. Its not from a simple change, like touching a test file, or a file the test depends on -- nor a restart of sbt. I am pretty confused. maven had issues when I tried to add scala code to network-common, it would compile the scala code but not make it available to java. I'm working around that by just coding in java anyhow. I'd really like to be able to run my tests in sbt, though, it makes the development iterations much faster. thanks, Imran
trouble with sbt building network-* projects?
Has anyone else noticed very strange build behavior in the network-* projects? maven seems to the doing the right, but sbt is very inconsistent. Sometimes when it builds network-shuffle it doesn't know about any of the code in network-common. Sometimes it will completely skip the java unit tests. And then some time later, it'll suddenly decide it knows about some more of the java unit tests. Its not from a simple change, like touching a test file, or a file the test depends on -- nor a restart of sbt. I am pretty confused. maven had issues when I tried to add scala code to network-common, it would compile the scala code but not make it available to java. I'm working around that by just coding in java anyhow. I'd really like to be able to run my tests in sbt, though, it makes the development iterations much faster. thanks, Imran
Re: Using CUDA within Spark / boosting linear algebra
Also, check the JNILoader output. Remember, for netlib-java to use your system libblas all you need to do is setup libblas.so.3 like any native application would expect. I haven't ever used the cublas real BLAS implementation, so I'd be interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check that all the runtime links are in order. Btw, I have some DGEMM wrappers in my netlib-java performance module... and I also planned to write more in MultiBLAS (until I mothballed the project for the hardware to catch up, which is probably has and now I just need a reason to look at it) On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote: Hey Sam, The running times are not big O estimates: The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I think there is something wrong with the netlib/cublas combination. Sam already mentioned that cuBLAS doesn't implement the CPU BLAS interfaces. I checked the CUDA doc and it seems that to use GPU BLAS through the CPU BLAS interface we need to use NVBLAS, which intercepts some Level 3 CPU BLAS calls (including GEMM). So we need to load nvblas.so first and then some CPU BLAS library in JNI. I wonder whether the setup was correct. Alexander, could you check whether GPU is used in the netlib-cublas experiments? You can tell it by watching CPU/GPU usage. Best, Xiangrui On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday sam.halli...@gmail.com wrote: Don't use big O estimates, always measure. It used to work back in the days when double multiplication was a bottleneck. The computation cost is effectively free on both the CPU and GPU and you're seeing pure copying costs. Also, I'm dubious that cublas is doing what you think it is. Can you link me to the source code for DGEMM? I show all of this in my talk, with explanations, I can't stress enough how much I recommend that you watch it if you want to understand high performance hardware acceleration for linear algebra :-) On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote: The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 1x1. I did the following on a g2.2xlarge instance with BIDMat: val n = 1 val f = rand(n, n) flip; f*f; val rf = flop flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop flip; g*g; val rgg = flop The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I'm not sure whether my CPU-GPU-CPU code simulates the netlib-cublas path. But based on the result, the data copying overhead is definitely not as big as 20x at n = 1. Best, Xiangrui On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday sam.halli...@gmail.com wrote: I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level algorithms so is not a toolkit for applied mathematics research and development --- but it works well as a toolkit for higher level analysis (e.g. for analysts and practitioners). I believe BIDMat's approach is the best way to get performance out of GPU hardware at the moment but I also have strong evidence to suggest that the hardware will catch up and the memory transfer costs between CPU/GPU will disappear meaning that there will be no need for custom GPU kernel implementations. i.e. please continue to use BLAS primitives when writing new algorithms and only go to the GPU for an alternative optimised implementation. Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer an API that looks like BLAS but takes pointers to special regions in the GPU memory region. Somebody has written a wrapper around CUDA to create a proper BLAS library but it only gives marginal performance over the CPU because of the memory transfer overhead. This slide from my talk http://fommil.github.io/scalax14/#/11/2 says it all. X axis is matrix size, Y axis is logarithmic time to do DGEMM. Black line is the cheating time for the GPU and the green line is after copying the memory to/from the GPU memory. APUs have the potential to eliminate the green line. Best regards, Sam Ulanov, Alexander alexander.ula...@hp.com writes: Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old.
Re: trouble with sbt building network-* projects?
bq. I have to keep cd'ing into network/common, run mvn install, then go back to network/shuffle and run some other mvn command over there. Yeah - been through this. Having continuous testing for maven would be nice. On Fri, Feb 27, 2015 at 11:31 AM, Imran Rashid iras...@cloudera.com wrote: well, perhaps I just need to learn to use maven better, but currently I find sbt much more convenient for continuously running my tests. I do use zinc, but I'm looking for continuous testing. This makes me think I need sbt for that: http://stackoverflow.com/questions/11347633/is-there-a-java-continuous-testing-plugin-for-maven 1) I really like that in sbt I can run ~test-only com.foo.bar.SomeTestSuite (or whatever other pattern) and just leave that running as I code, without having to go and explicitly trigger mvn test and wait for the result. 2) I find sbt's handling of sub-projects much simpler (when it works). I'm trying to make changes to network/common network/shuffle, which means I have to keep cd'ing into network/common, run mvn install, then go back to network/shuffle and run some other mvn command over there. I don't want to run mvn at the root project level, b/c I don't want to wait for it to compile all the other projects when I just want to run tests in network/common. Even with incremental compiling, in my day-to-day coding I want to entirely skip compiling sql, graphx, mllib etc. -- I have to switch branches often enough that i end up triggering a full rebuild of those projects even when I haven't touched them. On Fri, Feb 27, 2015 at 1:14 PM, Ted Yu yuzhih...@gmail.com wrote: bq. to be able to run my tests in sbt, though, it makes the development iterations much faster. Was the preference for sbt due to long maven build time ? Have you started Zinc on your machine ? Cheers On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid iras...@cloudera.com wrote: Has anyone else noticed very strange build behavior in the network-* projects? maven seems to the doing the right, but sbt is very inconsistent. Sometimes when it builds network-shuffle it doesn't know about any of the code in network-common. Sometimes it will completely skip the java unit tests. And then some time later, it'll suddenly decide it knows about some more of the java unit tests. Its not from a simple change, like touching a test file, or a file the test depends on -- nor a restart of sbt. I am pretty confused. maven had issues when I tried to add scala code to network-common, it would compile the scala code but not make it available to java. I'm working around that by just coding in java anyhow. I'd really like to be able to run my tests in sbt, though, it makes the development iterations much faster. thanks, Imran