Nathan Bell wrote: > On Thu, Feb 12, 2009 at 8:19 AM, Michael Abshoff > <michael.absh...@googlemail.com> wrote:
Hi, >> No even close. The current generation peaks at around 1.2 TFlops single >> precision, 280 GFlops double precision for ATI's hardware. The main >> problem with those numbers is that the memory on the graphics card >> cannot feed the data fast enough into the GPU to achieve theoretical >> peak. So those hundreds of GFlops are pure marketing :) >> > > If your application is memory bandwidth limited, then yes you're not > likely to see 100s of GFlops anytime soon. However, compute limited > application can and do achieve 100s of GFlops on GPUs. Basic > operations like FFTs and (level 3) BLAS are compute limited, as are > the following applications: > http://www.ks.uiuc.edu/Research/gpu/ > http://www.dam.brown.edu/scicomp/scg-media/report_files/BrownSC-2008-27.pdf Yes, certainly. But Sturla implied that some "random consumer GPU" (to put a negative spin on it :) could do the above. There also seems to be a huge expectation that "porting your code to the GPU" will make it 10 to 100 times faster. There are cases like that as mentioned above, but this only applies to a subset of problems. Another problem is RAM for many datasets I work with and 512 to 1024 MB aren't just plainly cutting it. This means Tesla cards at $1k upward and all the sudden we are playing a different game. 9 months ago at the beginning when we started playing with CUDA we took a MacBook pro with a decent NVidia card and laughed hard after it become clear that its Core2 with either ATLAS or the AccelerateFramework (which is more or ATLAS for its BLAS bits) was faster than the build in NVidia card with either single or double precision. Surely, this is a consumer level laptop GPU, but I did expect more. >> So in reality you might get anywhere from 20% to 60% (if you are lucky) >> locally before accounting for transfers from main memory to GPU memory >> and so on. Given that recent Intel CPUs give you about 7 to 11 Glops >> Double per core and libraries like ATLAS give you that performance today >> without the need to jump through hoops these number start to look a lot >> less impressive. > > You neglect to mention that CPUs, which have roughly 1/10th the memory > bandwidth of high-end GPUs, are memory bound on the very same > problems. You will not see 7 to 11 GFLops on a memory bound CPU code > for the same reason you argue that GPUs don't achieve 100s of GFLops > on memory bound GPU codes. I am seeing 7 to 11 GFLOP per core for matrix matrix multiplies on Intel CPUs using Strassen for matrix matrix multiplies. And we did scale out linear on 16 core Opterons as well as a 64 core Itanium box using ATLAS for BLAS level 3 matrix matrix multiplu. When you have multiple GPUs you do not have shared memory architectures (AFAIK the 4 GPU boxen sold by NVidia have fast buses between the cards, but aren't ccNUMA or anything like that - please correct me if I am wrong). > In severely memory bound applications like sparse matrix-vector > multiplication (i.e. A*x for sparse A) the best GPU performance you > can expect is ~10 GFLops on the GPU and ~1 GFLop on the CPU (in double > precision). We discuss this problem in the following tech report: > http://forums.nvidia.com/index.php?showtopic=83825 Ok, I care about dense operations primarily, but it is interesting to see that the GPU fares well on sparse LA. > It's true that host<->device transfers can be a bottleneck. In many > cases, the solution is to simply leave the data resident on the GPU. Well, that assumes you have enough memory locally for your working set. And if not you need to be clever about caching and I did not see any code in CUDA that takes care of that job for you. I have seen libraries like libflame that claim to do that for you, but I have not played with them yet. > For instance, you could imagine a variant of ndarray that held a > pointer to a device array. Of course this requires that the other > expensive parts of your algorithm also execute on the GPU so you're > not shuttling data over the PCIe bus all the time. Absolutely. I think that GPUs can fill a large niche for scientific computations, but it is not (yet?) the general purpose CPU it is sometimes made out to be. > > Full Disclosure: I'm a researcher at NVIDIA Cool. Thanks for the links by the way. As I mentioned we have bought Tesla hardware and are working on getting our code to use GPUs for numerical linear algebra, exact linear algebra and shortly also things like monte carlo simulation. I do think that the GPU is extremely useful for much of the above, but there are plenty of programming issues to resolve and a lot of infrastructure code to be written before GPU computing becomes ubiquitous. After the last new thing I had put my hope in (the Cell CPU) basically turned out to be a dud I am hesitant about anything until the code I am running actually sees the benefit. The thing with NVidia I am unhappy about is that CUDA is not free as in freedom. I am not a FSF zealot, so I will not try to convince anyone to make their software free. Given a choice between OpenCL and CUDA you have the lead at the moment because you actually have been shipping a working product for more than a year, but I am not so sure that in the long term OpenCL won't get people's mindshare. If you look at the history of 3D acceleration we started with numerous APIs that we all supplanted by OpenGL which then got pushed aside by DirectX. Anyway, no point in ranting here any more ;) Cheers, Michael _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion